Transformer Architecture Explained — Under the Hood

Why Transformers?

Before transformers, recurrent neural networks (RNNs) processed language one token at a time — each step depending on the previous hidden state. This sequential bottleneck meant training was slow and long-range dependencies were lost. The transformer solved this with a radical idea: process all tokens in parallel.

RNN — Sequential Processing

The

cat

sat

the

mat

Processing token 1 of 6...O(n) steps

Key Differences

Processing

Sequential — token by token

Parallel — all tokens at once

Long-range deps

Degrade over distance (vanishing gradient)

Direct attention between any token pair

Training speed

Slow — can't parallelize across time steps

Fast — GPU parallelism on full sequence

Scalability

Hard to scale beyond 100M params

Scales to trillions of parameters

Analogy — Orchestra Conductor: An RNN is like a musician reading sheet music one note at a time, passing the melody to the next player. A transformer is like an orchestra conductor who sees the entire score at once — every instrument (token) can see and harmonize with every other instrument simultaneously. The conductor (attention) decides which sections are most relevant to the current passage, allowing complex harmonies that sequential reading could never capture.

"Attention Is All You Need" (Vaswani et al., 2017) showed that self-attention alone — without any recurrence or convolutions — could achieve state-of-the-art translation quality. This single paper spawned GPT, BERT, and every modern LLM. The key unlock was parallelism: transformers turned training from a sequential bottleneck into a massively parallel matrix multiplication that GPUs excel at.

The Architecture Overview

A transformer is a stack of identical blocks, each containing the same five operations. GPT-2 uses 12 blocks, GPT-3 uses 96, and LLaMA 2 70B uses 80. The architecture is remarkably simple — its power comes from scale and the attention mechanism.

Transformer Block Stack

Input Embeddings

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Output Logits

Layer Norm

Normalizes activations across the feature dimension before each sublayer. Stabilizes training by ensuring consistent input magnitudes, enabling deeper networks and higher learning rates.

Layer Norm

→

Multi-Head Attention

→

Add &

→

Feed-Forward Network

→

Add &

Animated — Data Flowing Through a Transformer Block

→

InputToken embeddings

▸

≡

LayerNormNormalize features

▸

⊙

AttentionQ·K^T → softmax → V

▸

Add & NormResidual + normalize

▸

FFNExpand → GELU → Project

▸

Add & NormResidual + output

The entire transformer block is applied identically N times — only the learned weights differ between blocks. Earlier blocks tend to learn syntactic patterns (grammar, word order), while later blocks learn semantic relationships (meaning, reasoning). This emergent specialization happens purely from training — it's not designed in.

Input Embeddings

Before a transformer can process text, each token must become a vector. The token embedding captures semantic meaning (what the word means), while the positional encoding captures position (where it appears). These two vectors are combined to produce the input to the first transformer block.

Token → Embedding → Position → Input

Sentence

The cat sat

→

Tokens & IDs

The

ID: 464

cat

ID: 2368

sat

ID: 3290

→

Embedding + Position

[0.12, 0.55, 0.78, 1.33]

d_model=4

[0.17, 0.77, 0.56, 0.88]

d_model=4

[1.35, -1.23, 0.11, 1.67]

d_model=4

Positional Encoding Schemes

Fixed sin/cos functions at different frequencies. Position i, dimension d: PE(i,2d) = sin(i/10000^(2d/d_model)), PE(i,2d+1) = cos(i/10000^(2d/d_model)). Used in the original transformer — no learned parameters.

Position 0: “The”

0.00

1.00

0.00

1.00

Position 1: “cat”

0.84

0.54

0.01

1.00

Position 2: “sat”

0.91

-0.42

0.02

1.00

Analogy — Library Index System: Token embeddings are like a book's content — they tell you what the book is about. Positional encodings are like the shelf number and catalog index — they tell you where it belongs. Without the index, a library is just a pile of books. Without positional encodings, a transformer is just a bag of words with no sense of order.

Positional encoding is essential because self-attention is permutation-invariant — without it, "the cat sat on the mat" and "mat the on sat cat the" would produce identical representations. RoPE has become the dominant scheme because it encodes relative position through the attention dot product, enabling better length generalization.

Layer Normalization

Layer normalization rescales activations to have zero mean and unit variance across the feature dimension. This seemingly simple operation is critical — without it, activations can explode or vanish as data passes through dozens of transformer blocks, making training unstable or impossible.

Before → After Normalization

Before (raw activations)

3.2

-1.5

0.8

4.1

-0.3

2.7

-2.1

1.4

mean=1.04var=4.38

After (normalized)

1.03

-1.21

-0.11

1.46

-0.64

0.79

-1.50

0.17

mean≈0.00var≈1.00

Pre-Norm (GPT-2, LLaMA)

Input x

LayerNorm(x)

Attention(x')

x + Attention(x')

LayerNorm(y)

FFN(y')

y + FFN(y')

Pre-norm places normalization before each sublayer. This produces more stable gradients during training, enabling higher learning rates and faster convergence. Nearly all modern LLMs use pre-norm.

The switch from post-norm to pre-norm was one of the most impactful yet least discussed changes in transformer history. GPT-2 adopted pre-norm and found it dramatically improved training stability — enabling the scaling push that led to GPT-3, LLaMA, and all modern LLMs. RMSNorm (a simplified variant) is now standard in LLaMA and Mistral.

Multi-Head Attention

Multi-head attention is the core mechanism that gives transformers their power. Instead of computing a single attention function, the model runs multiple attention heads in parallel, each learning to focus on different types of relationships — syntax, semantics, position, and more.

Multi-Head Attention Flow

Input

Q, K, V

Split Heads

Attention

Concat

Linear

How it works

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

The dot product between Q and K measures how relevant each token is to every other token. Dividing by √d_k prevents the softmax from saturating when dimensions are large.

MultiHead = Concat(head₁, ..., head_h) × W^O

Each head operates on a d_model/h dimensional subspace. With 8 heads and d_model=512, each head works in 64 dimensions — cheap individually but powerful together.

Deep Dive

Explore Attention in Depth

Interactive Q/K/V matrices, attention heatmaps, causal masks, and the full math behind self-attention.

Analogy — Team of Specialists: Multi-head attention is like a consulting team analyzing a document. One specialist focuses on grammar (subject-verb agreement), another on meaning (who does “it” refer to?), a third on structure (is this a question or a statement?), and a fourth on long-range context (what was discussed three paragraphs ago?). Each specialist writes a report (head output), and they're all merged into one comprehensive analysis (concatenation + projection).

In GPT-3's 175B parameter model, each of the 96 layers has 96 attention heads, each working in 128 dimensions. That's 9,216 heads total, each independently learning what to attend to. Research has shown that different heads specialize: some track syntax, some resolve pronouns, some handle long-range dependencies — and you can even identify individual "induction heads" responsible for in-context learning.

Feed-Forward Network

After attention gathers context from other tokens, the feed-forward network processes each token independently. It consists of two linear transformations with a GELU activation between them, expanding to 4× the model dimension before projecting back. This is where the model stores and retrieves factual knowledge.

FFN(x) = W₂ · GELU(W₁ · x + b₁) + b₂

Input (d=4)

0.8

-0.3

0.5

-0.7

→

Expanded (d=16)

Before GELU

→

Output (d=4)

2.4

-0.0

1.7

0.2

Input Dim

Hidden Dim (4×)

Output Dim

The FFN accounts for roughly 2/3 of a transformer's parameters. In GPT-3 (d_model=12288), the FFN expands to 49,152 dimensions — each layer's FFN alone has ~1.2 billion parameters. Research suggests these layers function as a massive key-value memory: the first layer's weights encode patterns (keys) and the second layer stores the associated knowledge (values).

Residual Connections

Residual connections (skip connections) add the input of a sublayer directly to its output: output = x + Sublayer(x). This creates a "gradient highway" that lets gradients flow unimpeded through dozens of layers, enabling the extreme depth that modern transformers require.

Skip Connection Visualization

Input

Attention

x + Sublayer(x)

← skip path

FFN

x + Sublayer(x)

← skip path

Output

Gradient Magnitude by Layer

With residual connections, gradients maintain healthy magnitudes across all layers. The skip path provides an unimpeded route for gradient flow, regardless of depth.

Layer 8

0.44

Layer 7

0.56

Layer 6

0.65

Layer 5

0.69

Layer 4

0.72

Layer 3

0.79

Layer 2

0.91

Layer 1

1.03

Analogy — Building with Scaffolding: Residual connections are like the scaffolding on a tall building under construction. Each floor (layer) does its work — adding walls, wiring, plumbing (attention, FFN) — but the scaffolding (skip connection) ensures that workers and materials (gradients) can reach any floor directly, without needing to climb through every floor below. Remove the scaffolding from a 96-story building and nothing above the 10th floor gets built.

He et al. (2015) introduced residual connections for image classification with ResNet, enabling 152-layer networks for the first time. The transformer adopted this idea and pushed it further — LLaMA 2 70B has 80 blocks, each with two residual connections, creating 160 skip paths. Without them, training a network this deep would be mathematically impossible due to gradient vanishing.

Scaling Laws

Transformer performance follows remarkably predictable power laws: loss decreases as a smooth function of parameters, data, and compute. The Chinchilla paper showed that many models were undertrained — given a compute budget, there's an optimal balance between model size and training tokens.

Parameters vs Loss

Compute Budget Explorer

Compute Budget: 10^23.5 FLOPs

10^20 (small)10^26 (frontier)

Optimal Model Size

229.6B

parameters

Optimal Training Data

229.6B

tokens

Key Reference Points

GPT-3175B params, 300B tokens

Chinchilla70B params, 1.4T tokens

LLaMA 270B params, 2T tokens

GPT-4 (est.)~1.8T params, ~13T tokens

Chinchilla (Hoffmann et al., 2022) changed how labs train models. GPT-3 used 175B parameters but only 300B training tokens — Chinchilla showed that a 70B model trained on 1.4T tokens outperforms it at the same compute cost. The lesson: don't just make models bigger — train them on more data. LLaMA took this further, training even longer to optimize for inference cost rather than training compute.

Encoder vs Decoder

Not all transformers are the same. The three major architectural variants — encoder-only, decoder-only, and encoder-decoder — differ in how they mask attention, and this determines what tasks they excel at. The decoder-only architecture won the LLM era because generation is the hardest and most general capability.

BERT — Encoder-Only

BERT sees all tokens simultaneously — every token can attend to every other token, in both directions. This makes it excellent at understanding tasks where you have the full input available, like classification, named entity recognition, and semantic search.

Attention Type: Bidirectional (full)

Typical Use Cases

• Text classification
• Named entity recognition
• Semantic search / embeddings
• Question answering (extractive)

Attention Mask Pattern

The

cat

sat

the

The

✓

cat

✓

sat

✓

the

✓

Every token attends to every other token (bidirectional)

Analogy — Reading vs Writing: An encoder (BERT) is like a reader who sees the entire page at once — great for comprehension and search, but can't write new text. A decoder (GPT) is like a writer who crafts one word at a time, never peeking ahead — this constraint forces it to truly understand language in order to predict what comes next. An encoder-decoder (T5) is like a translator: read the full source text, then write the translation word by word.

The decoder-only architecture (GPT) has become dominant not because it's theoretically superior, but because generation subsumes understanding — a model that can generate coherent text necessarily understands it. BERT-style encoders still excel at embedding and retrieval tasks (search, RAG), while encoder-decoder models remain strong for structured transformations like translation.

Transformer Variants

The transformer architecture has escaped the domain of text. The same attention mechanism that powers GPT-4 now processes images, audio, protein sequences, and video — proving that self-attention is a universal computation primitive for sequential and structured data.

Vision Transformer (ViT)

Computer Vision

Splits an image into fixed-size patches (e.g. 16×16 pixels), flattens each patch into a vector, and processes the sequence of patch embeddings through a standard transformer encoder. Proved that pure attention — without any convolutions — achieves state-of-the-art image classification when trained on enough data.

Key Innovation

Treats image patches as tokens

Notable Models

ViT, DeiT, CLIP, SAM

Architecture Pattern

Image → 16×16 Patches → Patch Embeddings

↓

+ Positional Encoding

↓

Transformer Block 1

Transformer Block 2

Transformer Block 3 ... ×N

↓

Class Token → Image Classification

Under the Hood

Back to How LLMs Work

See the full LLM pipeline — from tokenization and training to inference and streaming.

The transformer's universality is its most remarkable property. The same architecture that predicts the next word also folds proteins, generates images, transcribes speech, and plays games. This suggests that self-attention isn't just good at language — it's a fundamental computational primitive for learning patterns in any structured data.