Deep Dive

The Inference Pipeline

From HTTP request to streamed tokens — an interactive visual journey through prefill, decode, KV-cache, sampling, batching, speculative decoding, and the GPU hardware that makes it all possible.

This guide covers what happens after the model is trained: how a prompt becomes a response, why decode is memory-bound, how serving frameworks maximize throughput, and what to monitor in production.

20 min readBy Hammad Abbasi Interactive
01

Request Lifecycle

Every LLM API call travels through a deterministic pipeline before a single token is generated. Understanding this end-to-end flow is critical for debugging latency and optimizing throughput.

End-to-end request flow

Browser

0 ms

API Gateway

~2 ms

Auth

~5 ms

Context Assembly

~10 ms

Tokenize

~1 ms

Model (Prefill + Decode)

~200–2000 ms

Detokenize

~0.5 ms

Stream to Client

Continuous

Browser

0 ms

User types a prompt and hits send. The client packages the message into an HTTP POST request with the conversation history and sends it to the API endpoint.

Analogy — Restaurant Kitchen: An inference request flows like a restaurant order. The waiter takes your order (API gateway), the host checks your reservation (auth), the kitchen preps ingredients (context assembly + tokenization), the chef cooks (prefill + decode), and the food is plated and delivered course by course (streaming). The kitchen (GPU) is always the bottleneck — everything else takes seconds compared to cooking time.

The total wall-clock time is dominated by the model (prefill + decode). Everything else — auth, tokenization, detokenization — adds single-digit milliseconds. Optimizing inference means optimizing the model step.

02

Context Assembly

Before the model sees a single token, the serving layer assembles the full context: system prompt, conversation history, retrieved documents, and the user's message — all within a strict token budget.

Token budget calculator
System Prompt500 tokens
Conversation History2,000 tokens
RAG Context4,000 tokens
User Message200 tokens
Budget visualization
System:500
History:2,000
RAG:4,000
User:200
Reserved Output:4,096
✓ 6,700 / 28,672 input tokens used (21,972 remaining)

Truncation Strategy

RAG context often consumes the largest share of the token budget. A common production pattern is to over-retrieve, then re-rank and truncate to fit the budget — balancing relevance against context length.

03

The Prefill Phase

The prefill phase processes the entire input prompt in a single forward pass. All input tokens are processed in parallel — this is where the GPU's massive parallelism shines. It's compute-bound, not memory-bound.

Prefill: parallel processing
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
t16
t17
t18
t19
t20
t21
t22
t23
t24
t25
t26
t27
t28
t29
t30
t31
All tokens processed simultaneously in one forward pass

Compute-Bound

GPU tensor cores are the bottleneck — high arithmetic intensity

KV-Cache Population

Keys and Values for every input token are computed and stored

TTFT: time to first token
Prompt Length2,048 tokens
36.9ms estimated TTFT
128
2 ms
2K
37 ms
8K
147 ms
16K
295 ms

TTFT (Time to First Token) = prefill time + scheduling overhead. It scales roughly linearly with prompt length because the full prompt must be processed before the first output token is generated.

Long system prompts and RAG contexts directly increase TTFT. Prompt caching (reusing KV-cache from identical prompt prefixes) can eliminate redundant prefill computation for repeated system prompts.

04

The Decode Phase

After prefill, the model switches to decode mode: generating one token at a time, autoregressively. Each new token depends on all previous tokens. This phase is memory-bandwidth-bound, not compute-bound.

Autoregressive generation

Tokens generated: 0 / 15
Why decode is memory-bound
Tokens / second55

Typical for 70B model on single H100

Each decode step:

1Read entire model weights from HBM (~140 GB for 70B FP16)
2Read KV-cache for all previous tokens
3Perform a single matrix-vector multiply
4Write one new KV entry, output one token

Arithmetic Intensity

~1 FLOP/byte

Very low — memory-bound

Bottleneck

HBM Bandwidth

3.35 TB/s on H100

Decode throughput is almost entirely limited by how fast you can read model weights from GPU memory. A 70B FP16 model at 140 GB, with 3.35 TB/s bandwidth, gives a theoretical ceiling of ~24 tokens/sec per request. Batching multiple requests amortizes this cost.

Animated — Autoregressive Token Generation
Thequickbrownfoxjumpsoverthelazy
Generated: 1/8 tokensEach token requires a full forward pass through all layers, using the KV-cache from previous tokens
05

KV-Cache Deep Dive

The KV-cache stores precomputed Key and Value tensors so they don't need to be recomputed at every decode step. Without it, generation would be O(n²) per token. With it, it's O(n).

KV-Cache size calculator
Sequence Length4,096 tokens

2 × 80 layers × 8 KV heads × 128 dim × 4,096 seq × 2 bytes

1.25GB

of 80 GB H100 HBM3

Without cache vs with cache

Without KV-Cache

O(n²)

per token generated

At seq 4,096: 16.8M attention ops

With KV-Cache

O(n)

per token generated

At seq 4,096: 4,096 attention ops

Without caching, each new token would recompute attention over all previous tokens from scratch. The KV-cache stores prior K and V projections so each decode step only computes the new token's attention against the cached history.

PagedAttention (vLLM)
P0
P1
P2
P3
P4
free
free
free

PagedAttention manages KV-cache like virtual memory: allocating fixed-size pages on demand instead of reserving a contiguous block for the maximum sequence length. This eliminates memory fragmentation and enables near-zero waste, increasing batch sizes by 2–4×.

Analogy — Assembly Line with Memory: The KV-cache is like a factory assembly line where each worker remembers every product they've already processed. Without the cache, worker #50 would have to re-inspect all 49 previous products before handling #51. With it, each worker just recalls their notes from the previous products and only inspects the new one. The savings compound: generating the 1000th token is just as fast as the 2nd, instead of 1000× slower.

KV-cache memory is often the binding constraint on batch size. A 70B model with 128K context at FP16 uses ~10 GB of KV-cache per request. On an 80 GB GPU, after model weights (~140 GB for FP16, but quantized to fit), you may only fit 2–4 concurrent requests.

06

Sampling Strategies

The model outputs raw logits (unnormalized scores) for every token in its vocabulary. Sampling strategies transform these logits into a probability distribution and then select the next token. Different settings produce dramatically different outputs.

Sampling parameters
Temperature1.00

Controls randomness. Low = deterministic, high = creative.

Top-K0

Keep only top K tokens. 0 = disabled.

Top-P (nucleus)1.00

Keep tokens until cumulative probability ≥ P.

Min-P0.00

Drop tokens with prob < min_p × max_prob.

Repetition Penalty1.00

Penalize tokens that appeared recently.

Token probability distribution
Prompt: "The capital of France is"
Paris
85.6%
Lyon
3.9%
the
2.9%
a
2.1%
France
1.6%
Berlin
1.2%
city
1.0%
located
0.8%
known
0.6%
called
0.5%

Active tokens: 10 / 10 · Entropy: 1.01 bits

Analogy — Rolling Dice with Loaded Sides: Sampling from an LLM is like rolling dice where each side has a different weight. Temperature controls how loaded the dice are: at T=0 it's a certainty (always lands on the heaviest side), at T=1 it's a fair roll weighted by probability, and at T=2 it's nearly uniform — any side could come up. Top-p removes the rarest sides entirely, while top-k limits you to only the k heaviest sides.

Temperature and top-p are the two most impactful parameters. For factual tasks (coding, math), use temperature 0–0.3 with no top-p filtering. For creative writing, temperature 0.8–1.2 with top-p 0.9 gives good diversity without incoherence.

07

Continuous Batching

In static batching, all requests in a batch must wait for the longest one to finish. Continuous batching (iteration-level scheduling) lets requests leave and enter the batch at every decode step, dramatically improving GPU utilization.

Static batching
R1
10 tokens
idle 40 steps
R2
50 tokens
R3
30 tokens
idle 20 steps

Problem: R1 finishes in 10 steps but waits for R2 (50 steps). GPU cycles wasted on padding.

Continuous batching
R1
10 tokens
R2
50 tokens
R3
30 tokens

Solution: When R1 finishes at step 10, R4 immediately joins the batch. No GPU cycles wasted.

Static Throughput

~15 req/s

Continuous Throughput

~45 req/s

Improvement

~3×

Analogy — Traffic Controller: Continuous batching is like an air traffic controller managing a busy runway. Instead of waiting for all planes in a group to land before accepting new ones (static batching), the controller slots new arrivals into open landing windows as soon as a previous plane clears the runway. No GPU cycle is wasted waiting for the slowest request in a batch to finish.

vLLM, TensorRT-LLM, and SGLang all implement continuous batching. It is the single most impactful serving optimization — enabling 2–4× throughput improvement with no loss in per-request latency.

08

Speculative Decoding

Speculative decoding uses a small, fast “draft” model to generate candidate tokens, then the large target model verifies all candidates in a single forward pass. When the draft model guesses correctly, you get multiple tokens for the cost of one.

Draft → Verify → Accept/Reject

Small model (e.g. 7B) generates 5 candidate tokens at ~200 tokens/sec

Tokens:isabeautifulcityin
Speedup calculator
Acceptance Rate70%

Effective Speedup

3.8×

Tokens per verify step

3.8

Why it works

Why is verification cheap?

Verifying N tokens costs the same as generating 1 token — the target model processes all N in parallel (like prefill), not sequentially.

When does it help most?

When the draft model has high agreement with the target model — e.g., for boilerplate code, formulaic text, or repetitive patterns.

What's the catch?

You need a good draft model. If acceptance rate drops below ~50%, the overhead of running two models exceeds the speedup.

Analogy — Express Lane: Speculative decoding is like the express checkout at a grocery store. A quick cashier (draft model) scans your items rapidly, guessing prices. A supervisor (target model) checks the receipt in one pass — if the guesses are right, you leave immediately. If any price is wrong, only that item gets re-scanned. Most items are common (bread, milk), so the guess is usually right, and you get through 2–3× faster.

Speculative decoding is mathematically lossless — the output distribution is identical to running the target model alone. It's a pure latency optimization: same quality, 2–3× faster for suitable workloads.

09

GPU Architecture for LLMs

Understanding GPU hardware constraints is essential for inference optimization. The NVIDIA H100 is the current workhorse for large-scale LLM serving. Here's why memory bandwidth — not compute — is the bottleneck during decode.

NVIDIA H100 SXM specifications
GPU Memory (HBM3)80 GB
Memory Bandwidth3.35 TB/s
FP16 Tensor Core989 TFLOPS
FP8 Tensor Core1,979 TFLOPS
NVLink Bandwidth900 GB/s
Memory bandwidth bottleneck

During decode, each token requires reading the entire model weights from HBM. For a 70B parameter model at FP16:

Model size: 70B × 2 bytes = 140 GB

Bandwidth: 3.35 TB/s = 3,350 GB/s

Max tokens/s: 3,350 / 140 ≈ 24 tokens/s

This is the theoretical ceiling for single-request decode on one H100.

Tensor Cores: Matrix Multiply in Hardware

Matrix A

4×4

×

Matrix B

4×4

= Result in a single clock cycle (fused multiply-accumulate)

Tensor parallelism: split model across GPUs
GPU Count1 × H100

GPU 0

140.0 GB

Model per GPU

140.0 GB

Total Bandwidth

3.35 TB/s

Max tokens/s (theory)

N/A

140.0 GB exceeds 80 GB per-GPU HBM capacity. Model does not fit.

Tensor parallelism splits each layer's weight matrices across GPUs, requiring an all-reduce after every layer. NVLink's 900 GB/s bandwidth makes this feasible, but communication overhead means scaling is sub-linear — 8 GPUs give roughly 5–6× the throughput of one.

10

Observability & Production

Running LLMs in production requires real-time visibility into latency, throughput, GPU utilization, and queue health. The right metrics and alerts turn a black box into a manageable system.

Live inference dashboard (simulated)

TTFT (P50)

180 ms

Tokens / sec

52.0

Latency P50

45 ms

Latency P95

210 ms

GPU Utilization

82%

Queue Depth

12

Autoscaling: GPU allocation
Request Volume (req/s)50
Required: 2 × H100 at ~45 req/s/GPU with continuous batching
SLA monitoring & alerts

TTFT P95

SLA: < 300 ms

252 ms

✓ OK

TPS P50

SLA: > 40 tok/s

52.0 tok/s

✓ OK

Error Rate

SLA: < 0.1%

0.03%

✓ OK

GPU Memory

SLA: < 90%

78%

✓ OK

Queue Depth

SLA: < 25

12

✓ OK

The most important production metric is TTFT P95 — it directly impacts user-perceived responsiveness. If TTFT spikes, check: (1) queue depth (requests waiting for GPU), (2) prompt length (long prefills), or (3) KV-cache eviction (cache thrashing under memory pressure).

Building LLM Infrastructure?

I architect enterprise inference pipelines — optimized serving, multi-GPU deployment, observability, and the infrastructure to run LLMs reliably at scale.

Get In Touch