Deep Dive

How Large Language Models Work

At its core, an LLM is a next-token predictor.

Modern assistants are built on top of that core and then shaped by post-training methods such as instruction tuning, RLHF, Constitutional AI, or DPO — and can be further improved at inference time through techniques like chain-of-thought prompting and additional test-time compute. This is the interactive, visual journey through how all of it works.

25 min readBy Hammad Abbasi Interactive
01

The AI Family Tree

Before diving into how LLMs work, let's place them in context. LLMs sit at a specific point in a hierarchy of increasingly specialized AI techniques.

Full AI Landscape — Visual Overview
AI Taxonomy — Click to Explore

Click or hover any node to see details

Key Milestones

Frank Rosenblatt builds the first neural network — a single neuron that can learn to classify inputs.

LLMs sit at a very specific intersection: they are transformer-based deep learning models applied to NLP, trained with massive compute. Deep learning itself is a cross-cutting technique — it powers vision, speech, and robotics too. Understanding where LLMs fit in this landscape helps you see what they are (and aren't) good at.

02

Neural Networks 101

Neural networks are the foundation of deep learning. They are composed of simple units — neurons — connected in layers. Each neuron takes inputs, applies weights, sums them, and passes the result through an activation function.

A Single Neuron
x1=1
×0.7
x2=0.5
×-0.3
x3=0.8
×0.5
Σ
= 0.95
f(x)
0.95

inputs × weights → sum → activation function → output

Activation Function Explorer
Input: 0.5Output: 0.500
relu(x) = max(0, x)

Rectified Linear Unit. Zero for negatives, linear for positives. Simple, fast, and the default for most networks.

Forward Pass — 3-Layer Network
Input
Hidden
Output

Signals flow left to right. Each neuron applies weights + activation. The pattern of activations at the output layer is the network's prediction.

Feedforward Architecture

A single neuron is just multiply-add plus a nonlinearity. The power comes from stacking millions of them in layers — each layer learns increasingly abstract features. A transformer with 70B parameters is just this concept scaled to an extreme.

Under the Hood

Explore Neural Networks in Depth

Perceptrons, backpropagation, loss landscapes, CNNs, RNNs — the full interactive journey through the building blocks of deep learning.

03

Vector Spaces

Neural networks operate on numbers, not words. An embedding maps each word (or token) to a point in a high-dimensional vector space. Words with similar meanings end up near each other. Drag to rotate, scroll to zoom.

3D Word Embedding Space — Drag to Rotate
royalty
animals
cities
code
people
Vector Arithmetic

king - man + woman = ?

Embeddings capture relationships as directions. The "royalty" direction from man→king is the same as woman→queen.

kingman+woman?
Cosine Similarity

Pick two words and see how similar their embeddings are. Cosine = 1 means identical direction, 0 means orthogonal.

0.995

Very similar — these words share strong semantic overlap.

Real LLM embeddings use 4,096–12,288 dimensions, not 3. The 3D projection here is simplified to show the concept. In high-dimensional space, the cluster structure and arithmetic relationships hold even more precisely.

Under the Hood

Explore Embeddings in Depth

Word2Vec, contextual embeddings, cosine similarity, positional encodings, vector search, and RAG retrieval.

04

Loss & Gradient Descent

Neural networks learn by minimizing a loss function — a number that measures how wrong the model's predictions are. Gradient descent is the algorithm that adjusts weights to reduce this loss, step by step.

Training Loop
Gradient Descent Visualization
minimum

The ball follows the steepest downhill direction (negative gradient) toward the loss minimum. Each step = one weight update.

Learning Rate Comparison

The model learns, but painfully slowly. It might take 10× longer to converge — wasting compute and money.

Cross-Entropy Loss — The LLM Training Objective

During training, the model predicts a probability for the correct next token. Cross-entropy loss measures how far off that prediction is. Move the slider to see how the loss changes.

Loss = -log(Pcorrect) = -log(0.70) = 0.357
P(correct token) = 0.70Loss = 0.357
Wrong (0.01)Confident (0.99)
Loss vs P(correct)P(correct)

The entire pre-training objective is just cross-entropy loss applied to next-token prediction, summed over trillions of tokens. When the model assigns high probability to the correct next token, loss is low. Gradient descent nudges weights to make correct predictions more likely.

05

Probability for LLMs

At their core, LLMs are probability machines. They model P(next token | all previous tokens). Understanding probability distributions, conditional probability, and entropy is essential to understanding how LLMs generate text.

Next-Token Probability Distribution
P(next token | "The cat sat on the")
mat
35.0%
floor
25.0%
bed
15.0%
roof
10.0%
chair
8.0%
table
7.0%
Conditional Probability Chain

LLMs generate text by chaining conditional probabilities: P(sequence) = P(w1) × P(w2|w1) × P(w3|w1,w2) × ...

TheP = 0.12×
catP = 0.08×
satP = 0.15×
onP = 0.45×
theP = 0.72×
matP = 0.35
P(sequence) = 1.633e-4
Entropy & Perplexity

Adjust the distribution to see how entropy (uncertainty) and perplexity change. Lower entropy = the model is more confident.

Token A60%
Token B20%
Token C15%
Token D5%
1.53
Entropy (bits)
2.9
Perplexity

Perplexity ≈ "how many equally likely choices." A perplexity of 1 means the model is perfectly certain. GPT-4 achieves ~10-15 perplexity on typical text.

Probability → Generation

Everything an LLM does — from writing code to answering questions — is just repeatedly sampling from a conditional probability distribution. The "intelligence" comes from the quality of that distribution, shaped by trillions of training tokens.

06

The Big Picture

An LLM at its core is a next-token predictor. Modern assistants layer instruction tuning, RLHF, and inference-time techniques on top. When you send a message, this is the end-to-end journey — from your keystrokes to the streamed response.

Inference Pipeline — Live Walkthrough
REQUEST
PREFILL
DECODE
OUTPUT
REQUEST Phase
PREFILL Phase
DECODE Phase
autoregressive loop — repeats per token
OUTPUT Phase
Response:The capital of France isParis
TTFT: ~350ms

The prefill phase processes the full prompt in parallel (compute-heavy, ~100–500ms). Then the decode loop generates tokens one at a time, each taking ~10–30ms thanks to KV-cache — no reprocessing of earlier tokens. A 500-word response involves ~700 decode iterations, streamed progressively through safety filters to your browser.

07

Training Pipeline

Before an LLM can answer a single question, it goes through an expensive, multi-stage training process. Here is what each stage does and why it matters.

Training Pipeline — Click Any Stage
Phase 1 of 7

Data Collection

Trillions of tokens are gathered from the public web (Common Crawl), digitized books (Books3, Gutenberg), open-source code (GitHub, Stack), Wikipedia, scientific papers, and purpose-built curated datasets. Quality pipelines deduplicate content, filter toxic material, remove PII, and balance domain representation. The resulting corpus for a frontier model (GPT-4 class) is estimated at 10–13 trillion tokens.

10T+
Tokens
100s
Sources
100+
Languages

Data Sources Flowing Into Corpus

Web
Books
Code
Papers
10T+
tokens collected

Data from multiple sources is deduplicated, filtered for quality, and balanced across domains before tokenization.

Under the Hood

Explore the Training Pipeline in Depth

Data pipelines, distributed training, RLHF, DPO, evaluation benchmarks, quantization, and deployment.

08

Tokenization

LLMs don't see words — they see tokens. A tokenizer (BPE / SentencePiece) splits text into sub-word units from a fixed vocabulary. This is the first transformation your query undergoes.

Interactive Tokenizer
INPUT TEXT
Hello, how are you?
BPE ENCODE
TOKENS (6 tokens)
"Hello"ID: 9906
","ID: 11
" how"ID: 703
" are"ID: 527
" you"ID: 499
"?"ID: 30

How BPE Builds Its Vocabulary

1
Start with bytes

Initialize with 256 byte-level tokens

2
Count pairs

Find the most frequent adjacent token pair

3
Merge

Replace all occurrences with a new merged token

4
Repeat

Continue until vocabulary reaches target size

Token count directly determines cost, context window usage, and billing. The word "indistinguishable" costs 4 tokens while "the" costs 1. Code and non-English text are often less efficiently tokenized — a key reason multilingual performance varies.

Under the Hood

Explore Tokenization in Depth

BPE step-by-step, SentencePiece, vocabulary tradeoffs, multilingual challenges, and tokenization artifacts.

09

The Transformer

The transformer is the architecture that makes next-token prediction work at scale. Each layer transforms its input through attention and feed-forward operations — and a GPT-4 class model stacks 96–120 of these layers deep. Post-training (instruction tuning, RLHF, DPO) shapes what the model does with these predictions; the transformer is how it makes them.

Live Transformer Forward Pass
Input
ThecapitalofFranceis
+residual
+residual
× 96–120 layers deep
Predicted next token:
ThecapitalofFranceisParis
p = 0.72
Token + Position Embedding — Deep Dive

Token + Position Embedding

Each token ID is mapped to a high-dimensional vector (e.g., 12,288 dims for GPT-4 class). A positional encoding is added so the model knows token order. The result is a matrix of shape [sequence_length × hidden_size].

DIMENSIONS
token_embed[vocab_size × d_model] + pos_embed[max_seq_len × d_model]
GPT-4 class: d_model = 12,288 | vocab = 100K | max_seq = 128K
Under the Hood

Explore Transformers in Depth

Positional encodings, layer normalization, residual connections, scaling laws, encoder vs decoder, and transformer variants.

10

Self-Attention

Self-attention is the core mechanism that makes transformers powerful. It lets every token in the sequence look at every other token to build context-aware representations.

Attention Computation Flow
Step 1

Each token's embedding is multiplied by three learned weight matrices to produce Query (Q), Key (K), and Value (V) vectors. Q represents "what am I looking for?", K represents "what do I contain?", and V represents "what information do I carry?"

Q = X · W_Q, K = X · W_K, V = X · W_V
Under the Hood

Explore Attention Mechanisms in Depth

Q/K/V math, attention heatmaps, causal masking, cross-attention, Flash Attention, GQA, and modern variants.

11

Softmax & Probabilities

After the transformer forward pass, the model outputs raw scores (logits) for every token in the vocabulary. Softmax converts these into a probability distribution. Temperature controls the shape.

Softmax Function
P(tokeni) = exp(zi / T) / Σ exp(zj / T)
z = logits  |  T = temperature  |  Output: probability distribution summing to 1

Low temperature (0.1–0.3): The distribution becomes extremely peaked. The highest-logit token dominates. Near-deterministic — great for factual Q&A, code generation.

Temperature = 1.0: The default. Probabilities reflect the model's learned distribution faithfully. Balanced between diversity and coherence.

High temperature (1.5–2.0): The distribution flattens. Lower-probability tokens get a real chance. More creative but risks incoherence.

Before vs After Softmax
RAW LOGITS
Paris
8.2
London
5.1
Berlin
3.8
Tokyo
2.4
the
1.1
AFTER SOFTMAX (T=1)
Paris
94.2%
London
4.2%
Berlin
1.2%
Tokyo
0.3%
the
0.1%
Interactive — Try Changing Temperature
TEMPERATURE1.0
Deterministic (0.1)Creative (2.0)
PROMPT: "The capital of France is" → NEXT TOKEN PROBABILITIES
Paris
94.2%
London
4.2%
Berlin
1.2%
Tokyo
0.3%
the
0.1%
Rome
0.0%
Madrid
0.0%
a
0.0%

Top-p (nucleus) sampling picks from the smallest set of tokens whose cumulative probability exceeds a threshold (e.g., 0.9). Top-k keeps only the K most likely tokens. These are applied after temperature scaling and before the final random sample.

12

Next-Token Prediction

This is the heart of what makes an LLM a next-token predictor in action. The model predicts one token, appends it, and runs another forward pass — repeating until done. Post-training (RLHF, DPO) shapes which tokens it prefers; chain-of-thought prompting and test-time compute improve how it reasons through each step.

Autoregressive Generation — Prefill + Decode
NoLoopYesFull Prompt TokensEmbedding LookupPREFILL: Parallel Forward PassKV-CacheDECODE: Forward PassLogits — 100K scoresSoftmax + TemperatureTop-k / Top-p → SampleSafety FilterStop Token?Append + Update KV-CacheDetokenize → Stream

Click any node in the chart to see details

Autoregressive Generation — Live Demo
SEQUENCE
ThecapitalofFranceis

Phase 1: Prefill (runs once)

Full prompt
Embedding lookup
Parallel forward pass
Populate KV-Cache

Compute-bound. All prompt tokens processed simultaneously via GPU parallelism. Determines time-to-first-token (TTFT).

Phase 2: Decode Loop (repeats per token)

Read KV-Cache
Forward pass (1 token)
Logits → Softmax
Sample token
Update KV-Cache
Memory-bandwidth-bound. Each token reads cached K,V — no recomputation. Repeats until STOP token or max_tokens.

KV-Cache: Why It Doesn't Re-compute Everything

Without optimization, generating token N would require computing attention over all N previous tokens — an O(n²) operation that grows quadratically. KV-cache stores the Key and Value matrices from all previous positions in GPU memory. Each new token only computes its own Q, then attends to the cached K and V. This reduces each step to O(n), making generation practical for long sequences.

Without KV-Cache
O(n²) per token

Token 500 reprocesses all 500 tokens. Token 501 reprocesses all 501. Compute grows quadratically.

With KV-Cache
O(n) per token

Token 501 computes only its own Q, reads cached K,V for tokens 1–500. One new row of attention.

Memory Cost

For a 70B model (80 layers, 128 heads, 128 dim/head) with 128K context: KV-cache ≈ 2 × 80 × 128 × 128 × 128K × 2 bytes ≈ 40GB VRAM

13

The Inference Pipeline

Everything comes together here. This is the complete journey from the moment you press "Send" to the first token appearing on screen (time-to-first-token) and every token after.

Inference Architecture — Click Any Component
CLIENTGATEWAY & PREPMODEL COREOUTPUT & DELIVERYPOST /v1/chatquerychunksfull prompttoken IDsvectorspopulate K,Vcached K,VappendtokenSSE eventBrowser / Chat UISSE StreamAPI GatewayContext AssemblyRAG / Vector DBTokenizer (BPE)Embedding TablePrefill EngineKV-CacheDecode EngineLogits → SoftmaxSamplerBatch SchedulerGPU ClusterSafety & GuardrailsDetokenizerObservability

Click any component to explore the architecture

Full Inference Pipeline with Timing
Request
Prefill
Decode
Output
Request Phase
Prefill Phase
Decode Phase
Output Phase

Time-to-first-token (TTFT) is typically 200–500ms for a frontier model. After that, each subsequent token takes 10–30ms (tokens-per-second). A 500-token response at 50 tokens/sec takes about 10 seconds to fully stream — but feels instant because you see tokens arrive progressively.

Under the Hood

Explore the Inference Pipeline in Depth

Prefill vs decode, KV-cache, sampling strategies, continuous batching, speculative decoding, and GPU architecture.

14

Streaming & Delivery

LLM APIs don't wait for the full response — they stream tokens as they're generated using Server-Sent Events. This is how text appears progressively in ChatGPT and every other LLM interface.

Server-Sent Events — Live
What You See in the Browser
Assistant

Time-to-First-Token (TTFT)

200–500ms. The user sees the first word almost immediately while the model continues generating.

Tokens per Second (TPS)

30–100 TPS for frontier models. Small models can exceed 200 TPS on optimized hardware.

Building with LLMs?

I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.

Get In Touch