April 13, 20268 min read

inference: treating AI like a systems problem, not magic

this blog is gonna be a quick read


tldr

  • built a full custom inference layer from scratch in Python over four projects.
  • crashed a GPU intentionally to prove sequential processing is a lie.
  • wrote a custom FastAPI server wrapping SGLang, achieving a 10× speedup via API‑level batching.
  • rebuilt the fundamental math of AI (the KV cache) from scratch in PyTorch, yielding a 14× speedup.
  • wired up Prometheus and Grafana to watch the memory mountain build and collapse in real time.
  • all of this to prepare the inference foundation for the Sonar ZK coprocessor on Solana.

prologue: the magic trick is just crazy amounts of engineering

to me, and maybe to many others as well, AI feels like magic. you type a prompt into a box, and seconds later, human‑like text pours out.

but when you start building infra around it, especially things like Sonar (which is my ZK coprocessor) where you need verifiable, deterministic computation, you quickly realize that treating the AI engine like a black box is a recipe for disaster.
if you don't understand how the engine uses memory, how it schedules requests, and why it bottlenecks, your system will collapse the second it gets real traffic.

so i spent the last few days breaking apart the infra that powers things like ChatGPT, Gemini, and whatever LLM you may be using. i wanted to see exactly how they juggle millions of users without their servers melting into slag.

and i decided what better way to ground my newly gained knowledge than to write it all out here (in a bit more formatted way than my twitter post).

what follows is kinda like a deep dive into how i built four sequential projects to break down the magic and understand the smart engineering underneath.


project 1 - serving and breaking a local llm

to understand how to fix a system, you first have to break it.

the goal of the first project was simple: deploy a model locally, hit it with a single request to prove it works, and then intentionally overload it.

i started by spinning up vLLM, which is by far the most popular, open‑source inference engine that i'm aware of. i asked it to serve microsoft/phi-2, a relatively small 2.7 billion parameter model.

my laptop runs on EndeavourOS and has an 8 GB GPU. Phi‑2 takes up about 5.4 GB of that just to load its weights. i spun up the server, and it immediately threw a fatal CUDA out‑of‑memory error.

that was lesson number one: the model weights are only half the battle. the inference engine tries to reserve massive chunks of your physical memory to hold context for future requests. if you don't strictly cap the memory utilization, it will blindly grab everything (yes, everything) and crash your card.

so, i switched to a smaller model (TinyLlama, 1.1B parameters) which fit perfectly. i sent a raw curl request. it responded. the engine was working.

then i brought the hammer down.

i wrote a Python script using aiohttp to bomb the server with 50 async requests at the exact same time. i wanted to see how the system handles a traffic spike.

if the server processed them one by one, sequentially, the latency for the 50th request would be astronomical. but that's not what happened.

the first few requests finished in about 0.04 seconds. but then, a massive block of almost 40 requests all finished at the exact same timestamp, exactly 0.65 seconds after they were sent.

that's not a coincidence. that's simply the engine refusing to process them sequentially. instead, the scheduler held them in a queue, packed them into the GPU memory together, and executed them in parallel.
this maximizes overall throughput at the cost of individual request latency.

this is called continuous batching, and it's the only reason production AI doesn't instantly crash.


project 2: building a custom inference server

project 1 proved the concept, but once again, vLLM acted like a generic black box. it gave me an OpenAI‑compatible API, and that was it.

when you integrate an inference layer into a larger system like Sonar, a generic API isn't enough. you need to intercept requests, format them specifically for ZK proofs, and have total programmatic control over how the engine batches them.

so i built a custom server from scratch.

i used FastAPI and embedded SGLang directly as a Python library. this meant the model weights loaded directly into my web server process with no middleman overhead.

i defined strict data models using Pydantic. but here's where things got messy. when i tried to boot the server, the entire thing crashed with a multiprocessing error.

because SGLang spins up multiple background processes (one for the scheduler, one for the worker), and because FastAPI runs its own async event loop, the child processes were recursively trying to initialize their own completely separate inference engines. they were clashing and instantly killing the main process.

i had to restructure the entire server using a modern FastAPI lifespan context manager to guarantee the engine was initialized only once, by the parent process, before any network traffic was accepted.

once that was stable, i wrote two endpoints:

  1. a standard /generate endpoint that takes one prompt.
  2. a custom /batch endpoint that takes an array of prompts.

then i benchmarked them.

  • test 1: fire 20 concurrent requests to the /generate endpoint.
    result: ~5.32 seconds.

  • test 2: fire one single request containing an array of 20 prompts to the /batch endpoint.
    result: ~0.51 seconds.

that's a literal 10× speedup.

when you rely on the engine to catch 20 separate network requests and group them on the fly, you waste MASSIVE amounts of networking and parsing overhead.

when you batch at the API level and hand the engine a perfectly packaged array, it instantly allocates the exact VRAM needed and blasts it through the GPU in one shot.


project 3: re-implementing kv cache

this is where we get into the raw math.

AI models generate text autoregressively. that means they generate exactly one word at a time. natively, to guess the 100th word, the system has to completely re‑read and re‑calculate the massive matrix multiplications for the first 99 words.

all of that math, repeated, every single time it prints a new word.

imagine trying to write a book, but having to read it from the start every time you want to add a single letter. the processing overhead scales quadratically.

the conclusion? it's insanely slow.

to prove this, i built a naive self‑attention layer from scratch in pure PyTorch. i gave it a dummy sequence and asked it to generate 1000 tokens on the CPU.

it took 1.36 seconds. it physically CRAWLED as the sequence got longer.

the fix for this is what i think is the single most important optimization in all of AI. it's called the KV cache.

instead of re‑calculating the math for the first 99 words, we just save the resulting tensors (the keys and values) in temporary memory. when we need the 100th word, we only calculate the math for the 99th word, grab the saved math for the rest from RAM, and predict.

it's like keeping your finger on the page and using sticky notes instead of starting the book over.

i rewrote my PyTorch implementation to include this cache logic. the layer now only projected the newest token, appended its key and value to the historical cache, and computed attention.

the result? 0.09 seconds.

a 14× speedup just by saving intermediate math.

but like all good things in life, there's a huge trade‑off. saving all those "sticky notes" for every single user eats up MASSIVE amounts of memory.
and this right here is exactly why AI servers need hundreds of gigabytes of expensive VRAM. we're literally trading memory capacity for raw speed.


project 4: tuning continuous batching and observability

so now we know the KV cache takes up memory. and we also know continuous batching tries to pack as many users into that memory as possible.

but if you tune your batching parameters wrong, the system falls apart.
if the cache is constantly at 100%, new requests get stuck in the queue and latency collapses.
if it's constantly at 20%, you're wasting thousands of dollars of idle GPU compute.

so to dumb it down; you cannot manage a system you cannot see.

so i did the most practical thing anyone who calls themselves an engineer would do. i built an observability stack to watch this happen in real time.

i spun up Prometheus to scrape data and Grafana to visualize it, using Docker host networking to cleanly route everything on my local machine.

i wired it to my vLLM server. and because software engineering is never easy (just call me a stupid man), i spent about fifteen to twenty minutes staring at empty graphs because i was using the deprecated metric name vllm:gpu_cache_usage_perc instead of the new vllm_kv_cache_usage_perc (thanks context7).

once i fixed the metric naming, i wrote an async Python script called sustained_load.py. this script spammed the engine with continuous, staggered requests for exactly 60 seconds.

then there it was. on the Grafana dashboard, i watched a perfect mountain form.

the KV cache utilization started at zero, spiked aggressively the second the script kicked off, plateaued near the very top as the scheduler maxed out the physical memory, and nosedived instantly back to zero the second the load test finished and the memory flushed.


epilogue: from magic to mechanics

when you type a prompt into ChatGPT today, it isn't just answering you. it's grabbing your prompt, grouping it with thousands of others, holding everyone's context in massive memory banks, and thinking of the next word for everyone simultaneously in one giant matrix multiplication.

before these four projects, AI inference was like magic to me. i would be lying if i said i understood the theory. hell, i didn't even know what inference really was (i'm deadass when i say i thought it had something to do with ray optics).
so obviously, the actual mechanics were obscured.

now tho, i know exactly why memory bandwidth is the true bottleneck of modern AI.
i know why API‑level batching destroys concurrent requests in throughput.
and i know how to monitor the memory pressure of an inference engine in production.

the pipeline has been proven. the engine has been understood.
the clear next step is taking this foundation and wiring it up to a ZK prover for the Sonar coprocessor.

but that's a story for another day.

all the code, benchmarks, and scripts for these four projects are open source. go check em out ^^

the lab (this is just the github repos lol)