At its core, an LLM is a next-token predictor.
Modern assistants are built on top of that core and then shaped by post-training methods such as instruction tuning, RLHF, Constitutional AI, or DPO — and can be further improved at inference time through techniques like chain-of-thought prompting and additional test-time compute. This is the interactive, visual journey through how all of it works.
Before diving into how LLMs work, let's place them in context. LLMs sit at a specific point in a hierarchy of increasingly specialized AI techniques.
Frank Rosenblatt builds the first neural network — a single neuron that can learn to classify inputs.
LLMs sit at a very specific intersection: they are transformer-based deep learning models applied to NLP, trained with massive compute. Deep learning itself is a cross-cutting technique — it powers vision, speech, and robotics too. Understanding where LLMs fit in this landscape helps you see what they are (and aren't) good at.
Neural networks are the foundation of deep learning. They are composed of simple units — neurons — connected in layers. Each neuron takes inputs, applies weights, sums them, and passes the result through an activation function.
inputs × weights → sum → activation function → output
Rectified Linear Unit. Zero for negatives, linear for positives. Simple, fast, and the default for most networks.
Signals flow left to right. Each neuron applies weights + activation. The pattern of activations at the output layer is the network's prediction.
A single neuron is just multiply-add plus a nonlinearity. The power comes from stacking millions of them in layers — each layer learns increasingly abstract features. A transformer with 70B parameters is just this concept scaled to an extreme.
Perceptrons, backpropagation, loss landscapes, CNNs, RNNs — the full interactive journey through the building blocks of deep learning.
Neural networks operate on numbers, not words. An embedding maps each word (or token) to a point in a high-dimensional vector space. Words with similar meanings end up near each other. Drag to rotate, scroll to zoom.
Embeddings capture relationships as directions. The "royalty" direction from man→king is the same as woman→queen.
Pick two words and see how similar their embeddings are. Cosine = 1 means identical direction, 0 means orthogonal.
Very similar — these words share strong semantic overlap.
Real LLM embeddings use 4,096–12,288 dimensions, not 3. The 3D projection here is simplified to show the concept. In high-dimensional space, the cluster structure and arithmetic relationships hold even more precisely.
Word2Vec, contextual embeddings, cosine similarity, positional encodings, vector search, and RAG retrieval.
Neural networks learn by minimizing a loss function — a number that measures how wrong the model's predictions are. Gradient descent is the algorithm that adjusts weights to reduce this loss, step by step.
The ball follows the steepest downhill direction (negative gradient) toward the loss minimum. Each step = one weight update.
The model learns, but painfully slowly. It might take 10× longer to converge — wasting compute and money.
During training, the model predicts a probability for the correct next token. Cross-entropy loss measures how far off that prediction is. Move the slider to see how the loss changes.
The entire pre-training objective is just cross-entropy loss applied to next-token prediction, summed over trillions of tokens. When the model assigns high probability to the correct next token, loss is low. Gradient descent nudges weights to make correct predictions more likely.
At their core, LLMs are probability machines. They model P(next token | all previous tokens). Understanding probability distributions, conditional probability, and entropy is essential to understanding how LLMs generate text.
LLMs generate text by chaining conditional probabilities: P(sequence) = P(w1) × P(w2|w1) × P(w3|w1,w2) × ...
Adjust the distribution to see how entropy (uncertainty) and perplexity change. Lower entropy = the model is more confident.
Perplexity ≈ "how many equally likely choices." A perplexity of 1 means the model is perfectly certain. GPT-4 achieves ~10-15 perplexity on typical text.
Everything an LLM does — from writing code to answering questions — is just repeatedly sampling from a conditional probability distribution. The "intelligence" comes from the quality of that distribution, shaped by trillions of training tokens.
An LLM at its core is a next-token predictor. Modern assistants layer instruction tuning, RLHF, and inference-time techniques on top. When you send a message, this is the end-to-end journey — from your keystrokes to the streamed response.
The prefill phase processes the full prompt in parallel (compute-heavy, ~100–500ms). Then the decode loop generates tokens one at a time, each taking ~10–30ms thanks to KV-cache — no reprocessing of earlier tokens. A 500-word response involves ~700 decode iterations, streamed progressively through safety filters to your browser.
Before an LLM can answer a single question, it goes through an expensive, multi-stage training process. Here is what each stage does and why it matters.
Trillions of tokens are gathered from the public web (Common Crawl), digitized books (Books3, Gutenberg), open-source code (GitHub, Stack), Wikipedia, scientific papers, and purpose-built curated datasets. Quality pipelines deduplicate content, filter toxic material, remove PII, and balance domain representation. The resulting corpus for a frontier model (GPT-4 class) is estimated at 10–13 trillion tokens.
Data from multiple sources is deduplicated, filtered for quality, and balanced across domains before tokenization.
Data pipelines, distributed training, RLHF, DPO, evaluation benchmarks, quantization, and deployment.
LLMs don't see words — they see tokens. A tokenizer (BPE / SentencePiece) splits text into sub-word units from a fixed vocabulary. This is the first transformation your query undergoes.
Initialize with 256 byte-level tokens
Find the most frequent adjacent token pair
Replace all occurrences with a new merged token
Continue until vocabulary reaches target size
Token count directly determines cost, context window usage, and billing. The word "indistinguishable" costs 4 tokens while "the" costs 1. Code and non-English text are often less efficiently tokenized — a key reason multilingual performance varies.
BPE step-by-step, SentencePiece, vocabulary tradeoffs, multilingual challenges, and tokenization artifacts.
The transformer is the architecture that makes next-token prediction work at scale. Each layer transforms its input through attention and feed-forward operations — and a GPT-4 class model stacks 96–120 of these layers deep. Post-training (instruction tuning, RLHF, DPO) shapes what the model does with these predictions; the transformer is how it makes them.
Each token ID is mapped to a high-dimensional vector (e.g., 12,288 dims for GPT-4 class). A positional encoding is added so the model knows token order. The result is a matrix of shape [sequence_length × hidden_size].
Positional encodings, layer normalization, residual connections, scaling laws, encoder vs decoder, and transformer variants.
Self-attention is the core mechanism that makes transformers powerful. It lets every token in the sequence look at every other token to build context-aware representations.
Each token's embedding is multiplied by three learned weight matrices to produce Query (Q), Key (K), and Value (V) vectors. Q represents "what am I looking for?", K represents "what do I contain?", and V represents "what information do I carry?"
Q/K/V math, attention heatmaps, causal masking, cross-attention, Flash Attention, GQA, and modern variants.
After the transformer forward pass, the model outputs raw scores (logits) for every token in the vocabulary. Softmax converts these into a probability distribution. Temperature controls the shape.
Low temperature (0.1–0.3): The distribution becomes extremely peaked. The highest-logit token dominates. Near-deterministic — great for factual Q&A, code generation.
Temperature = 1.0: The default. Probabilities reflect the model's learned distribution faithfully. Balanced between diversity and coherence.
High temperature (1.5–2.0): The distribution flattens. Lower-probability tokens get a real chance. More creative but risks incoherence.
Top-p (nucleus) sampling picks from the smallest set of tokens whose cumulative probability exceeds a threshold (e.g., 0.9). Top-k keeps only the K most likely tokens. These are applied after temperature scaling and before the final random sample.
This is the heart of what makes an LLM a next-token predictor in action. The model predicts one token, appends it, and runs another forward pass — repeating until done. Post-training (RLHF, DPO) shapes which tokens it prefers; chain-of-thought prompting and test-time compute improve how it reasons through each step.
Click any node in the chart to see details
Compute-bound. All prompt tokens processed simultaneously via GPU parallelism. Determines time-to-first-token (TTFT).
Without optimization, generating token N would require computing attention over all N previous tokens — an O(n²) operation that grows quadratically. KV-cache stores the Key and Value matrices from all previous positions in GPU memory. Each new token only computes its own Q, then attends to the cached K and V. This reduces each step to O(n), making generation practical for long sequences.
Token 500 reprocesses all 500 tokens. Token 501 reprocesses all 501. Compute grows quadratically.
Token 501 computes only its own Q, reads cached K,V for tokens 1–500. One new row of attention.
For a 70B model (80 layers, 128 heads, 128 dim/head) with 128K context: KV-cache ≈ 2 × 80 × 128 × 128 × 128K × 2 bytes ≈ 40GB VRAM
Everything comes together here. This is the complete journey from the moment you press "Send" to the first token appearing on screen (time-to-first-token) and every token after.
Click any component to explore the architecture
Time-to-first-token (TTFT) is typically 200–500ms for a frontier model. After that, each subsequent token takes 10–30ms (tokens-per-second). A 500-token response at 50 tokens/sec takes about 10 seconds to fully stream — but feels instant because you see tokens arrive progressively.
Prefill vs decode, KV-cache, sampling strategies, continuous batching, speculative decoding, and GPU architecture.
LLM APIs don't wait for the full response — they stream tokens as they're generated using Server-Sent Events. This is how text appears progressively in ChatGPT and every other LLM interface.
200–500ms. The user sees the first word almost immediately while the model continues generating.
30–100 TPS for frontier models. Small models can exceed 200 TPS on optimized hardware.
I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.
Get In Touch