Why Transformers?
Before transformers, recurrent neural networks (RNNs) processed language one token at a time — each step depending on the previous hidden state. This sequential bottleneck meant training was slow and long-range dependencies were lost. The transformer solved this with a radical idea: process all tokens in parallel.
Analogy — Orchestra Conductor: An RNN is like a musician reading sheet music one note at a time, passing the melody to the next player. A transformer is like an orchestra conductor who sees the entire score at once — every instrument (token) can see and harmonize with every other instrument simultaneously. The conductor (attention) decides which sections are most relevant to the current passage, allowing complex harmonies that sequential reading could never capture.
"Attention Is All You Need" (Vaswani et al., 2017) showed that self-attention alone — without any recurrence or convolutions — could achieve state-of-the-art translation quality. This single paper spawned GPT, BERT, and every modern LLM. The key unlock was parallelism: transformers turned training from a sequential bottleneck into a massively parallel matrix multiplication that GPUs excel at.
The Architecture Overview
A transformer is a stack of identical blocks, each containing the same five operations. GPT-2 uses 12 blocks, GPT-3 uses 96, and LLaMA 2 70B uses 80. The architecture is remarkably simple — its power comes from scale and the attention mechanism.
Normalizes activations across the feature dimension before each sublayer. Stabilizes training by ensuring consistent input magnitudes, enabling deeper networks and higher learning rates.
The entire transformer block is applied identically N times — only the learned weights differ between blocks. Earlier blocks tend to learn syntactic patterns (grammar, word order), while later blocks learn semantic relationships (meaning, reasoning). This emergent specialization happens purely from training — it's not designed in.
Input Embeddings
Before a transformer can process text, each token must become a vector. The token embedding captures semantic meaning (what the word means), while the positional encoding captures position (where it appears). These two vectors are combined to produce the input to the first transformer block.
Fixed sin/cos functions at different frequencies. Position i, dimension d: PE(i,2d) = sin(i/10000^(2d/d_model)), PE(i,2d+1) = cos(i/10000^(2d/d_model)). Used in the original transformer — no learned parameters.
Analogy — Library Index System: Token embeddings are like a book's content — they tell you what the book is about. Positional encodings are like the shelf number and catalog index — they tell you where it belongs. Without the index, a library is just a pile of books. Without positional encodings, a transformer is just a bag of words with no sense of order.
Positional encoding is essential because self-attention is permutation-invariant — without it, "the cat sat on the mat" and "mat the on sat cat the" would produce identical representations. RoPE has become the dominant scheme because it encodes relative position through the attention dot product, enabling better length generalization.
Layer Normalization
Layer normalization rescales activations to have zero mean and unit variance across the feature dimension. This seemingly simple operation is critical — without it, activations can explode or vanish as data passes through dozens of transformer blocks, making training unstable or impossible.
Pre-norm places normalization before each sublayer. This produces more stable gradients during training, enabling higher learning rates and faster convergence. Nearly all modern LLMs use pre-norm.
The switch from post-norm to pre-norm was one of the most impactful yet least discussed changes in transformer history. GPT-2 adopted pre-norm and found it dramatically improved training stability — enabling the scaling push that led to GPT-3, LLaMA, and all modern LLMs. RMSNorm (a simplified variant) is now standard in LLaMA and Mistral.
Multi-Head Attention
Multi-head attention is the core mechanism that gives transformers their power. Instead of computing a single attention function, the model runs multiple attention heads in parallel, each learning to focus on different types of relationships — syntax, semantics, position, and more.
The dot product between Q and K measures how relevant each token is to every other token. Dividing by √dk prevents the softmax from saturating when dimensions are large.
Each head operates on a d_model/h dimensional subspace. With 8 heads and d_model=512, each head works in 64 dimensions — cheap individually but powerful together.
Analogy — Team of Specialists: Multi-head attention is like a consulting team analyzing a document. One specialist focuses on grammar (subject-verb agreement), another on meaning (who does “it” refer to?), a third on structure (is this a question or a statement?), and a fourth on long-range context (what was discussed three paragraphs ago?). Each specialist writes a report (head output), and they're all merged into one comprehensive analysis (concatenation + projection).
In GPT-3's 175B parameter model, each of the 96 layers has 96 attention heads, each working in 128 dimensions. That's 9,216 heads total, each independently learning what to attend to. Research has shown that different heads specialize: some track syntax, some resolve pronouns, some handle long-range dependencies — and you can even identify individual "induction heads" responsible for in-context learning.
Feed-Forward Network
After attention gathers context from other tokens, the feed-forward network processes each token independently. It consists of two linear transformations with a GELU activation between them, expanding to 4× the model dimension before projecting back. This is where the model stores and retrieves factual knowledge.
The FFN accounts for roughly 2/3 of a transformer's parameters. In GPT-3 (d_model=12288), the FFN expands to 49,152 dimensions — each layer's FFN alone has ~1.2 billion parameters. Research suggests these layers function as a massive key-value memory: the first layer's weights encode patterns (keys) and the second layer stores the associated knowledge (values).
Residual Connections
Residual connections (skip connections) add the input of a sublayer directly to its output: output = x + Sublayer(x). This creates a "gradient highway" that lets gradients flow unimpeded through dozens of layers, enabling the extreme depth that modern transformers require.
With residual connections, gradients maintain healthy magnitudes across all layers. The skip path provides an unimpeded route for gradient flow, regardless of depth.
Analogy — Building with Scaffolding: Residual connections are like the scaffolding on a tall building under construction. Each floor (layer) does its work — adding walls, wiring, plumbing (attention, FFN) — but the scaffolding (skip connection) ensures that workers and materials (gradients) can reach any floor directly, without needing to climb through every floor below. Remove the scaffolding from a 96-story building and nothing above the 10th floor gets built.
He et al. (2015) introduced residual connections for image classification with ResNet, enabling 152-layer networks for the first time. The transformer adopted this idea and pushed it further — LLaMA 2 70B has 80 blocks, each with two residual connections, creating 160 skip paths. Without them, training a network this deep would be mathematically impossible due to gradient vanishing.
Scaling Laws
Transformer performance follows remarkably predictable power laws: loss decreases as a smooth function of parameters, data, and compute. The Chinchilla paper showed that many models were undertrained — given a compute budget, there's an optimal balance between model size and training tokens.
Chinchilla (Hoffmann et al., 2022) changed how labs train models. GPT-3 used 175B parameters but only 300B training tokens — Chinchilla showed that a 70B model trained on 1.4T tokens outperforms it at the same compute cost. The lesson: don't just make models bigger — train them on more data. LLaMA took this further, training even longer to optimize for inference cost rather than training compute.
Encoder vs Decoder
Not all transformers are the same. The three major architectural variants — encoder-only, decoder-only, and encoder-decoder — differ in how they mask attention, and this determines what tasks they excel at. The decoder-only architecture won the LLM era because generation is the hardest and most general capability.
BERT sees all tokens simultaneously — every token can attend to every other token, in both directions. This makes it excellent at understanding tasks where you have the full input available, like classification, named entity recognition, and semantic search.
- • Text classification
- • Named entity recognition
- • Semantic search / embeddings
- • Question answering (extractive)
Analogy — Reading vs Writing: An encoder (BERT) is like a reader who sees the entire page at once — great for comprehension and search, but can't write new text. A decoder (GPT) is like a writer who crafts one word at a time, never peeking ahead — this constraint forces it to truly understand language in order to predict what comes next. An encoder-decoder (T5) is like a translator: read the full source text, then write the translation word by word.
The decoder-only architecture (GPT) has become dominant not because it's theoretically superior, but because generation subsumes understanding — a model that can generate coherent text necessarily understands it. BERT-style encoders still excel at embedding and retrieval tasks (search, RAG), while encoder-decoder models remain strong for structured transformations like translation.
Transformer Variants
The transformer architecture has escaped the domain of text. The same attention mechanism that powers GPT-4 now processes images, audio, protein sequences, and video — proving that self-attention is a universal computation primitive for sequential and structured data.
Splits an image into fixed-size patches (e.g. 16×16 pixels), flattens each patch into a vector, and processes the sequence of patch embeddings through a standard transformer encoder. Proved that pure attention — without any convolutions — achieves state-of-the-art image classification when trained on enough data.
The transformer's universality is its most remarkable property. The same architecture that predicts the next word also folds proteins, generates images, transcribes speech, and plays games. This suggests that self-attention isn't just good at language — it's a fundamental computational primitive for learning patterns in any structured data.