Back to How LLMs Work
UNDER THE HOOD

Transformers

The architecture behind every modern AI system. From positional encodings and multi-head attention to scaling laws and encoder-decoder variants — an interactive deep dive into how transformers work.

01

Why Transformers?

Before transformers, recurrent neural networks (RNNs) processed language one token at a time — each step depending on the previous hidden state. This sequential bottleneck meant training was slow and long-range dependencies were lost. The transformer solved this with a radical idea: process all tokens in parallel.

RNN — Sequential Processing
The
cat
sat
on
the
mat
Processing token 1 of 6...O(n) steps
Key Differences
Processing
Sequential — token by token
Parallel — all tokens at once
Long-range deps
Degrade over distance (vanishing gradient)
Direct attention between any token pair
Training speed
Slow — can't parallelize across time steps
Fast — GPU parallelism on full sequence
Scalability
Hard to scale beyond 100M params
Scales to trillions of parameters

Analogy — Orchestra Conductor: An RNN is like a musician reading sheet music one note at a time, passing the melody to the next player. A transformer is like an orchestra conductor who sees the entire score at once — every instrument (token) can see and harmonize with every other instrument simultaneously. The conductor (attention) decides which sections are most relevant to the current passage, allowing complex harmonies that sequential reading could never capture.

"Attention Is All You Need" (Vaswani et al., 2017) showed that self-attention alone — without any recurrence or convolutions — could achieve state-of-the-art translation quality. This single paper spawned GPT, BERT, and every modern LLM. The key unlock was parallelism: transformers turned training from a sequential bottleneck into a massively parallel matrix multiplication that GPUs excel at.

02

The Architecture Overview

A transformer is a stack of identical blocks, each containing the same five operations. GPT-2 uses 12 blocks, GPT-3 uses 96, and LLaMA 2 70B uses 80. The architecture is remarkably simple — its power comes from scale and the attention mechanism.

Transformer Block Stack
Input Embeddings
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Output Logits
Layer Norm

Normalizes activations across the feature dimension before each sublayer. Stabilizes training by ensuring consistent input magnitudes, enabling deeper networks and higher learning rates.

Layer Norm
Multi-Head Attention
Add &
Feed-Forward Network
Add &
Animated — Data Flowing Through a Transformer Block
InputToken embeddings
LayerNormNormalize features
AttentionQ·K^T → softmax → V
+
Add & NormResidual + normalize
ƒ
FFNExpand → GELU → Project
+
Add & NormResidual + output

The entire transformer block is applied identically N times — only the learned weights differ between blocks. Earlier blocks tend to learn syntactic patterns (grammar, word order), while later blocks learn semantic relationships (meaning, reasoning). This emergent specialization happens purely from training — it's not designed in.

03

Input Embeddings

Before a transformer can process text, each token must become a vector. The token embedding captures semantic meaning (what the word means), while the positional encoding captures position (where it appears). These two vectors are combined to produce the input to the first transformer block.

Token → Embedding → Position → Input
Sentence
The cat sat
Tokens & IDs
The
ID: 464
cat
ID: 2368
sat
ID: 3290
Embedding + Position
[0.12, 0.55, 0.78, 1.33]
d_model=4
[0.17, 0.77, 0.56, 0.88]
d_model=4
[1.35, -1.23, 0.11, 1.67]
d_model=4
Positional Encoding Schemes

Fixed sin/cos functions at different frequencies. Position i, dimension d: PE(i,2d) = sin(i/10000^(2d/d_model)), PE(i,2d+1) = cos(i/10000^(2d/d_model)). Used in the original transformer — no learned parameters.

Position 0: “The
0.00
1.00
0.00
1.00
Position 1: “cat
0.84
0.54
0.01
1.00
Position 2: “sat
0.91
-0.42
0.02
1.00

Analogy — Library Index System: Token embeddings are like a book's content — they tell you what the book is about. Positional encodings are like the shelf number and catalog index — they tell you where it belongs. Without the index, a library is just a pile of books. Without positional encodings, a transformer is just a bag of words with no sense of order.

Positional encoding is essential because self-attention is permutation-invariant — without it, "the cat sat on the mat" and "mat the on sat cat the" would produce identical representations. RoPE has become the dominant scheme because it encodes relative position through the attention dot product, enabling better length generalization.

04

Layer Normalization

Layer normalization rescales activations to have zero mean and unit variance across the feature dimension. This seemingly simple operation is critical — without it, activations can explode or vanish as data passes through dozens of transformer blocks, making training unstable or impossible.

Before → After Normalization
Before (raw activations)
3.2
-1.5
0.8
4.1
-0.3
2.7
-2.1
1.4
mean=1.04var=4.38
After (normalized)
1.03
-1.21
-0.11
1.46
-0.64
0.79
-1.50
0.17
mean≈0.00var≈1.00
Pre-Norm (GPT-2, LLaMA)
Input x
LayerNorm(x)
Attention(x')
x + Attention(x')
LayerNorm(y)
FFN(y')
y + FFN(y')

Pre-norm places normalization before each sublayer. This produces more stable gradients during training, enabling higher learning rates and faster convergence. Nearly all modern LLMs use pre-norm.

The switch from post-norm to pre-norm was one of the most impactful yet least discussed changes in transformer history. GPT-2 adopted pre-norm and found it dramatically improved training stability — enabling the scaling push that led to GPT-3, LLaMA, and all modern LLMs. RMSNorm (a simplified variant) is now standard in LLaMA and Mistral.

05

Multi-Head Attention

Multi-head attention is the core mechanism that gives transformers their power. Instead of computing a single attention function, the model runs multiple attention heads in parallel, each learning to focus on different types of relationships — syntax, semantics, position, and more.

Multi-Head Attention Flow
Input
Q, K, V
Split Heads
Attention
Concat
Linear
How it works
Attention(Q, K, V) = softmax(QKT / √dk) × V

The dot product between Q and K measures how relevant each token is to every other token. Dividing by √dk prevents the softmax from saturating when dimensions are large.

MultiHead = Concat(head1, ..., headh) × WO

Each head operates on a d_model/h dimensional subspace. With 8 heads and d_model=512, each head works in 64 dimensions — cheap individually but powerful together.

Analogy — Team of Specialists: Multi-head attention is like a consulting team analyzing a document. One specialist focuses on grammar (subject-verb agreement), another on meaning (who does “it” refer to?), a third on structure (is this a question or a statement?), and a fourth on long-range context (what was discussed three paragraphs ago?). Each specialist writes a report (head output), and they're all merged into one comprehensive analysis (concatenation + projection).

In GPT-3's 175B parameter model, each of the 96 layers has 96 attention heads, each working in 128 dimensions. That's 9,216 heads total, each independently learning what to attend to. Research has shown that different heads specialize: some track syntax, some resolve pronouns, some handle long-range dependencies — and you can even identify individual "induction heads" responsible for in-context learning.

06

Feed-Forward Network

After attention gathers context from other tokens, the feed-forward network processes each token independently. It consists of two linear transformations with a GELU activation between them, expanding to 4× the model dimension before projecting back. This is where the model stores and retrieves factual knowledge.

FFN(x) = W₂ · GELU(W₁ · x + b₁) + b₂
Input (d=4)
0.8
-0.3
0.5
-0.7
Expanded (d=16)
+
+
+
+
-
-
-
-
·
+
+
+
+
-
-
-
Before GELU
Output (d=4)
2.4
-0.0
1.7
0.2
Input Dim
4
Hidden Dim (4×)
16
Output Dim
4

The FFN accounts for roughly 2/3 of a transformer's parameters. In GPT-3 (d_model=12288), the FFN expands to 49,152 dimensions — each layer's FFN alone has ~1.2 billion parameters. Research suggests these layers function as a massive key-value memory: the first layer's weights encode patterns (keys) and the second layer stores the associated knowledge (values).

07

Residual Connections

Residual connections (skip connections) add the input of a sublayer directly to its output: output = x + Sublayer(x). This creates a "gradient highway" that lets gradients flow unimpeded through dozens of layers, enabling the extreme depth that modern transformers require.

Skip Connection Visualization
Input
Attention
x + Sublayer(x)
← skip path
FFN
x + Sublayer(x)
← skip path
Output
Gradient Magnitude by Layer

With residual connections, gradients maintain healthy magnitudes across all layers. The skip path provides an unimpeded route for gradient flow, regardless of depth.

Layer 8
0.44
Layer 7
0.56
Layer 6
0.65
Layer 5
0.69
Layer 4
0.72
Layer 3
0.79
Layer 2
0.91
Layer 1
1.03

Analogy — Building with Scaffolding: Residual connections are like the scaffolding on a tall building under construction. Each floor (layer) does its work — adding walls, wiring, plumbing (attention, FFN) — but the scaffolding (skip connection) ensures that workers and materials (gradients) can reach any floor directly, without needing to climb through every floor below. Remove the scaffolding from a 96-story building and nothing above the 10th floor gets built.

He et al. (2015) introduced residual connections for image classification with ResNet, enabling 152-layer networks for the first time. The transformer adopted this idea and pushed it further — LLaMA 2 70B has 80 blocks, each with two residual connections, creating 160 skip paths. Without them, training a network this deep would be mathematically impossible due to gradient vanishing.

08

Scaling Laws

Transformer performance follows remarkably predictable power laws: loss decreases as a smooth function of parameters, data, and compute. The Chinchilla paper showed that many models were undertrained — given a compute budget, there's an optimal balance between model size and training tokens.

Parameters vs Loss
Parameters (log scale)Chinchilla optimalGPT-2GPT-3ChinchillaLLaMA 70BGPT-4 (est.)10^910^1010^1110^1210^13
Compute Budget Explorer
Compute Budget: 10^23.5 FLOPs
10^20 (small)10^26 (frontier)
Optimal Model Size
229.6B
parameters
Optimal Training Data
229.6B
tokens
Key Reference Points
GPT-3175B params, 300B tokens
Chinchilla70B params, 1.4T tokens
LLaMA 270B params, 2T tokens
GPT-4 (est.)~1.8T params, ~13T tokens

Chinchilla (Hoffmann et al., 2022) changed how labs train models. GPT-3 used 175B parameters but only 300B training tokens — Chinchilla showed that a 70B model trained on 1.4T tokens outperforms it at the same compute cost. The lesson: don't just make models bigger — train them on more data. LLaMA took this further, training even longer to optimize for inference cost rather than training compute.

09

Encoder vs Decoder

Not all transformers are the same. The three major architectural variants — encoder-only, decoder-only, and encoder-decoder — differ in how they mask attention, and this determines what tasks they excel at. The decoder-only architecture won the LLM era because generation is the hardest and most general capability.

BERT — Encoder-Only

BERT sees all tokens simultaneously — every token can attend to every other token, in both directions. This makes it excellent at understanding tasks where you have the full input available, like classification, named entity recognition, and semantic search.

Attention Type: Bidirectional (full)
Typical Use Cases
  • Text classification
  • Named entity recognition
  • Semantic search / embeddings
  • Question answering (extractive)
Attention Mask Pattern
The
cat
sat
on
the
The
cat
sat
on
the
Every token attends to every other token (bidirectional)

Analogy — Reading vs Writing: An encoder (BERT) is like a reader who sees the entire page at once — great for comprehension and search, but can't write new text. A decoder (GPT) is like a writer who crafts one word at a time, never peeking ahead — this constraint forces it to truly understand language in order to predict what comes next. An encoder-decoder (T5) is like a translator: read the full source text, then write the translation word by word.

The decoder-only architecture (GPT) has become dominant not because it's theoretically superior, but because generation subsumes understanding — a model that can generate coherent text necessarily understands it. BERT-style encoders still excel at embedding and retrieval tasks (search, RAG), while encoder-decoder models remain strong for structured transformations like translation.

10

Transformer Variants

The transformer architecture has escaped the domain of text. The same attention mechanism that powers GPT-4 now processes images, audio, protein sequences, and video — proving that self-attention is a universal computation primitive for sequential and structured data.

Vision Transformer (ViT)
Computer Vision

Splits an image into fixed-size patches (e.g. 16×16 pixels), flattens each patch into a vector, and processes the sequence of patch embeddings through a standard transformer encoder. Proved that pure attention — without any convolutions — achieves state-of-the-art image classification when trained on enough data.

Key Innovation
Treats image patches as tokens
Notable Models
ViT, DeiT, CLIP, SAM
Architecture Pattern
Image → 16×16 Patches → Patch Embeddings
+ Positional Encoding
Transformer Block 1
Transformer Block 2
Transformer Block 3 ... ×N
Class Token → Image Classification

The transformer's universality is its most remarkable property. The same architecture that predicts the next word also folds proteins, generates images, transcribes speech, and plays games. This suggests that self-attention isn't just good at language — it's a fundamental computational primitive for learning patterns in any structured data.