From HTTP request to streamed tokens — an interactive visual journey through prefill, decode, KV-cache, sampling, batching, speculative decoding, and the GPU hardware that makes it all possible.
This guide covers what happens after the model is trained: how a prompt becomes a response, why decode is memory-bound, how serving frameworks maximize throughput, and what to monitor in production.
Every LLM API call travels through a deterministic pipeline before a single token is generated. Understanding this end-to-end flow is critical for debugging latency and optimizing throughput.
0 ms
~2 ms
~5 ms
~10 ms
~1 ms
~200–2000 ms
~0.5 ms
Continuous
User types a prompt and hits send. The client packages the message into an HTTP POST request with the conversation history and sends it to the API endpoint.
Analogy — Restaurant Kitchen: An inference request flows like a restaurant order. The waiter takes your order (API gateway), the host checks your reservation (auth), the kitchen preps ingredients (context assembly + tokenization), the chef cooks (prefill + decode), and the food is plated and delivered course by course (streaming). The kitchen (GPU) is always the bottleneck — everything else takes seconds compared to cooking time.
The total wall-clock time is dominated by the model (prefill + decode). Everything else — auth, tokenization, detokenization — adds single-digit milliseconds. Optimizing inference means optimizing the model step.
Before the model sees a single token, the serving layer assembles the full context: system prompt, conversation history, retrieved documents, and the user's message — all within a strict token budget.
Truncation Strategy
RAG context often consumes the largest share of the token budget. A common production pattern is to over-retrieve, then re-rank and truncate to fit the budget — balancing relevance against context length.
The prefill phase processes the entire input prompt in a single forward pass. All input tokens are processed in parallel — this is where the GPU's massive parallelism shines. It's compute-bound, not memory-bound.
Compute-Bound
GPU tensor cores are the bottleneck — high arithmetic intensity
KV-Cache Population
Keys and Values for every input token are computed and stored
TTFT (Time to First Token) = prefill time + scheduling overhead. It scales roughly linearly with prompt length because the full prompt must be processed before the first output token is generated.
Long system prompts and RAG contexts directly increase TTFT. Prompt caching (reusing KV-cache from identical prompt prefixes) can eliminate redundant prefill computation for repeated system prompts.
After prefill, the model switches to decode mode: generating one token at a time, autoregressively. Each new token depends on all previous tokens. This phase is memory-bandwidth-bound, not compute-bound.
Typical for 70B model on single H100
Each decode step:
Arithmetic Intensity
~1 FLOP/byte
Very low — memory-bound
Bottleneck
HBM Bandwidth
3.35 TB/s on H100
Decode throughput is almost entirely limited by how fast you can read model weights from GPU memory. A 70B FP16 model at 140 GB, with 3.35 TB/s bandwidth, gives a theoretical ceiling of ~24 tokens/sec per request. Batching multiple requests amortizes this cost.
The KV-cache stores precomputed Key and Value tensors so they don't need to be recomputed at every decode step. Without it, generation would be O(n²) per token. With it, it's O(n).
2 × 80 layers × 8 KV heads × 128 dim × 4,096 seq × 2 bytes
of 80 GB H100 HBM3
Without KV-Cache
O(n²)
per token generated
At seq 4,096: 16.8M attention ops
With KV-Cache
O(n)
per token generated
At seq 4,096: 4,096 attention ops
Without caching, each new token would recompute attention over all previous tokens from scratch. The KV-cache stores prior K and V projections so each decode step only computes the new token's attention against the cached history.
PagedAttention manages KV-cache like virtual memory: allocating fixed-size pages on demand instead of reserving a contiguous block for the maximum sequence length. This eliminates memory fragmentation and enables near-zero waste, increasing batch sizes by 2–4×.
Analogy — Assembly Line with Memory: The KV-cache is like a factory assembly line where each worker remembers every product they've already processed. Without the cache, worker #50 would have to re-inspect all 49 previous products before handling #51. With it, each worker just recalls their notes from the previous products and only inspects the new one. The savings compound: generating the 1000th token is just as fast as the 2nd, instead of 1000× slower.
KV-cache memory is often the binding constraint on batch size. A 70B model with 128K context at FP16 uses ~10 GB of KV-cache per request. On an 80 GB GPU, after model weights (~140 GB for FP16, but quantized to fit), you may only fit 2–4 concurrent requests.
The model outputs raw logits (unnormalized scores) for every token in its vocabulary. Sampling strategies transform these logits into a probability distribution and then select the next token. Different settings produce dramatically different outputs.
Controls randomness. Low = deterministic, high = creative.
Keep only top K tokens. 0 = disabled.
Keep tokens until cumulative probability ≥ P.
Drop tokens with prob < min_p × max_prob.
Penalize tokens that appeared recently.
Active tokens: 10 / 10 · Entropy: 1.01 bits
Analogy — Rolling Dice with Loaded Sides: Sampling from an LLM is like rolling dice where each side has a different weight. Temperature controls how loaded the dice are: at T=0 it's a certainty (always lands on the heaviest side), at T=1 it's a fair roll weighted by probability, and at T=2 it's nearly uniform — any side could come up. Top-p removes the rarest sides entirely, while top-k limits you to only the k heaviest sides.
Temperature and top-p are the two most impactful parameters. For factual tasks (coding, math), use temperature 0–0.3 with no top-p filtering. For creative writing, temperature 0.8–1.2 with top-p 0.9 gives good diversity without incoherence.
In static batching, all requests in a batch must wait for the longest one to finish. Continuous batching (iteration-level scheduling) lets requests leave and enter the batch at every decode step, dramatically improving GPU utilization.
Problem: R1 finishes in 10 steps but waits for R2 (50 steps). GPU cycles wasted on padding.
Solution: When R1 finishes at step 10, R4 immediately joins the batch. No GPU cycles wasted.
Static Throughput
~15 req/s
Continuous Throughput
~45 req/s
Improvement
~3×
Analogy — Traffic Controller: Continuous batching is like an air traffic controller managing a busy runway. Instead of waiting for all planes in a group to land before accepting new ones (static batching), the controller slots new arrivals into open landing windows as soon as a previous plane clears the runway. No GPU cycle is wasted waiting for the slowest request in a batch to finish.
vLLM, TensorRT-LLM, and SGLang all implement continuous batching. It is the single most impactful serving optimization — enabling 2–4× throughput improvement with no loss in per-request latency.
Speculative decoding uses a small, fast “draft” model to generate candidate tokens, then the large target model verifies all candidates in a single forward pass. When the draft model guesses correctly, you get multiple tokens for the cost of one.
Small model (e.g. 7B) generates 5 candidate tokens at ~200 tokens/sec
Effective Speedup
3.8×
Tokens per verify step
3.8
Why is verification cheap?
Verifying N tokens costs the same as generating 1 token — the target model processes all N in parallel (like prefill), not sequentially.
When does it help most?
When the draft model has high agreement with the target model — e.g., for boilerplate code, formulaic text, or repetitive patterns.
What's the catch?
You need a good draft model. If acceptance rate drops below ~50%, the overhead of running two models exceeds the speedup.
Analogy — Express Lane: Speculative decoding is like the express checkout at a grocery store. A quick cashier (draft model) scans your items rapidly, guessing prices. A supervisor (target model) checks the receipt in one pass — if the guesses are right, you leave immediately. If any price is wrong, only that item gets re-scanned. Most items are common (bread, milk), so the guess is usually right, and you get through 2–3× faster.
Speculative decoding is mathematically lossless — the output distribution is identical to running the target model alone. It's a pure latency optimization: same quality, 2–3× faster for suitable workloads.
Understanding GPU hardware constraints is essential for inference optimization. The NVIDIA H100 is the current workhorse for large-scale LLM serving. Here's why memory bandwidth — not compute — is the bottleneck during decode.
During decode, each token requires reading the entire model weights from HBM. For a 70B parameter model at FP16:
Model size: 70B × 2 bytes = 140 GB
Bandwidth: 3.35 TB/s = 3,350 GB/s
Max tokens/s: 3,350 / 140 ≈ 24 tokens/s
This is the theoretical ceiling for single-request decode on one H100.
Tensor Cores: Matrix Multiply in Hardware
Matrix A
4×4
Matrix B
4×4
= Result in a single clock cycle (fused multiply-accumulate)
GPU 0
140.0 GB
Model per GPU
140.0 GB
Total Bandwidth
3.35 TB/s
Max tokens/s (theory)
N/A
⚠ 140.0 GB exceeds 80 GB per-GPU HBM capacity. Model does not fit.
Tensor parallelism splits each layer's weight matrices across GPUs, requiring an all-reduce after every layer. NVLink's 900 GB/s bandwidth makes this feasible, but communication overhead means scaling is sub-linear — 8 GPUs give roughly 5–6× the throughput of one.
Running LLMs in production requires real-time visibility into latency, throughput, GPU utilization, and queue health. The right metrics and alerts turn a black box into a manageable system.
TTFT (P50)
180 ms
Tokens / sec
52.0
Latency P50
45 ms
Latency P95
210 ms
GPU Utilization
82%
Queue Depth
12
TTFT P95
SLA: < 300 ms
252 ms
✓ OK
TPS P50
SLA: > 40 tok/s
52.0 tok/s
✓ OK
Error Rate
SLA: < 0.1%
0.03%
✓ OK
GPU Memory
SLA: < 90%
78%
✓ OK
Queue Depth
SLA: < 25
12
✓ OK
The most important production metric is TTFT P95 — it directly impacts user-perceived responsiveness. If TTFT spikes, check: (1) queue depth (requests waiting for GPU), (2) prompt length (long prefills), or (3) KV-cache eviction (cache thrashing under memory pressure).
I architect enterprise inference pipelines — optimized serving, multi-GPU deployment, observability, and the infrastructure to run LLMs reliably at scale.
Get In Touch