Attention is the core innovation that makes transformers work — the mechanism that lets models decide which parts of the input matter most for each prediction.
From basic Q/K/V projections to Flash Attention and modern variants like GQA, this interactive guide covers the math, the intuition, and the engineering behind attention at scale.
Attention is how a model decides which words matter most when predicting the next one. Think of it as a spotlight — for every token, the model shines a beam across the entire input and decides where to focus.
Click a word to see which other words it “attends” to. Brighter = stronger attention weight.
Try clicking “it” — notice how it attends strongly to “cat” (weight 0.42). The model has learned that “it” refers to “cat” through the attention mechanism, without any explicit coreference annotation.
Analogy — Cocktail Party: Attention works like your brain at a cocktail party. Dozens of conversations happen simultaneously (all tokens), but you unconsciously tune into the one that's most relevant to you — your name, an interesting topic, a friend's voice. The “attention weights” are how strongly you tune into each conversation. High weight = focused listening; low weight = background noise you mostly ignore.
Every attention head creates three vectors from each token: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). These projections are the foundation of the attention computation.
Token embeddings enter the attention layer
4 tokens, each with a 3-dimensional embedding
The Full Formula
Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V
Analogy — Detective Connecting Clues: Q/K/V works like a detective investigating a case. The Query is the question you're asking (“Who had motive?”). The Key is the label on each evidence folder (“alibi,” “fingerprint,” “motive”). The Value is the actual content inside each folder. The detective (attention) matches questions to folder labels (Q·K), then reads the contents (V) of the most relevant folders, weighted by how strong each match is.
The weight matrices Wq, Wk, Wv are the only learned parameters in attention. During training, the model learns what makes a good query, what makes a good key, and what information to store in values — all through backpropagation.
The attention score matrix shows how much each token attends to every other token. Each row is a probability distribution — it sums to 1.0 and represents where that token is “looking.”
Row 3 (“sat”) attends most strongly to “cat” with weight 0.30. Each row is independently normalized by softmax — tokens compete for attention within each row.
Attention(Q, K, V) = softmax(mask(Q·KT / √dk)) · V
Softmax converts raw attention logits into a probability distribution. Temperature controls how “sharp” or “flat” that distribution is — lower temperature means the model is more decisive, higher temperature means more uniform attention.
Raw Logits
After Softmax (T=1.00)
Analogy — Magnifying Glass with Adjustable Focus: Temperature is like the focus ring on a magnifying glass. At low temperature (T=0.1), the lens is focused to a pinpoint — only one token gets nearly all the attention. At high temperature (T=5.0), the lens is defocused and light spreads evenly across everything. The 1/√dk scaling in standard attention is like choosing the default focus that works well for the lens size (dimension).
At T=0.1 the highest logit (3.2 for “a”) dominates — softmax outputs nearly 1.0 for that token. At T=5.0 the distribution flattens and all tokens get roughly equal weight. Standard attention uses T=1.0, but the scaling by 1/√d_k serves a similar purpose.
Instead of one attention function, transformers run multiple attention heads in parallel. Each head independently learns what to attend to — one might focus on syntax, another on position, another on coreference. Their outputs are concatenated and projected.
In GPT-3, there are 96 attention heads per layer across 96 layers. Each head operates on a 128-dimensional subspace (12288 / 96 = 128). Research shows that different heads reliably specialize — some for syntax, some for coreference, some for positional patterns — without being explicitly trained to do so.
In autoregressive models like GPT, each token can only attend to tokens that came before it (and itself). This is enforced by masking — setting future positions to −∞ before softmax, which zeroes them out.
During generation, the model predicts one token at a time. It would be “cheating” if it could see future tokens. The mask enforces this causal constraint during both training and inference.
With the mask on, token 0 (“I”) can only attend to itself → weight 1.00. Token 4 (“!”) can attend to all 5 tokens, so its weights are spread across the full context.
Analogy — Reading a Mystery Novel: Causal masking forces the model to read like a human reads a mystery — one page at a time, never peeking ahead. When predicting what comes after “The butler entered the,” the model cannot see that the next word is “kitchen.” This constraint is what makes autoregressive generation possible: the model must genuinely predict each word using only what came before, just like you'd guess the murderer before the last chapter.
Toggle the mask off to see how the attention weights redistribute. Notice that early tokens (like “I”) change the most — with the mask on they're forced to attend only to themselves, but without it they can look at all positions.
In encoder-decoder models, the decoder needs to look at the encoder's output to ground its predictions. Cross-attention does this: the Query comes from the decoder, while Keys and Values come from the encoder.
Encoder Output (French)
Provides K and V
Decoder (English) — Click a token
Provides Q
Decoder token “cat” attending to encoder outputs:
Use Cases
Click “cat” in the decoder — it attends 0.75 to “chat” (the French translation). Cross-attention creates the bridge between the two sequences, letting the decoder selectively read from the encoder's representation.
Standard attention materializes the full N×N score matrix in GPU memory (HBM), which is O(N²). Flash Attention tiles the computation so that it fits in fast on-chip SRAM, dramatically reducing memory reads/writes while producing mathematically identical results.
O(N²) — stores full 4096×4096 attention matrix
O(N) — only Q/K/V tiles in SRAM at a time
High Bandwidth Memory
~2 TB/s bandwidth, ~80 GB capacity (A100)
Standard attention reads Q, K, writes N×N scores to HBM, reads them back, writes output. This IO dominates runtime.
On-Chip Static RAM
~19 TB/s bandwidth, ~20 MB capacity (A100)
Flash Attention tiles Q, K, V into blocks that fit in SRAM. Computes attention within each tile, accumulates output — never materializes the full N×N matrix.
Flash Attention doesn't approximate anything — it computes the exact same output as standard attention. The speedup comes entirely from being IO-aware: minimizing reads and writes to slow HBM by keeping data in fast SRAM through a careful tiling and recomputation strategy.
Not all attention is the same. Different architectures use different attention patterns to balance quality and efficiency. Research also reveals that trained heads develop recognizable patterns — like induction heads that enable in-context learning.
Each token attends to a fixed window of nearby tokens. Used in Longformer and BigBird for efficient long-context processing.
Induction heads are one of the most important discoveries in mechanistic interpretability. They allow models to “copy” from context — if the model sees “A B ... A”, an induction head learns to predict “B” will follow. This is a core mechanism behind in-context learning.
Standard multi-head attention is expensive at inference time because every head maintains its own KV-cache. Modern variants like MQA and GQA reduce this cost by sharing Key and Value heads across multiple Query heads.
The original design. Every attention head has its own Q, K, and V projections.
Query Heads (32)
KV Heads (32 KV heads)
Each Q head has its own dedicated KV head
d_model=4096, d_k=128, 32 layers, fp16 — per-sequence KV-cache size:
Also Worth Knowing
At 128K sequence length, MHA requires 64.0 GB of KV-cache per sequence, while GQA needs only 16.0 GB — a 4× reduction. This is why nearly all modern LLMs (Llama 3, Mistral, Gemma 2) use GQA.
I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.
Get In Touch