Under the Hood

Attention Mechanisms

Attention is the core innovation that makes transformers work — the mechanism that lets models decide which parts of the input matter most for each prediction.

From basic Q/K/V projections to Flash Attention and modern variants like GQA, this interactive guide covers the math, the intuition, and the engineering behind attention at scale.

20 min readBy Hammad Abbasi Interactive
01

What Is Attention?

Attention is how a model decides which words matter most when predicting the next one. Think of it as a spotlight — for every token, the model shines a beam across the entire input and decides where to focus.

Interactive — Click any word to see its attention

Click a word to see which other words it “attends” to. Brighter = stronger attention weight.

it” attends to:
The
0.04
cat
0.42
sat
0.08
on
0.02
the
0.03
mat
0.05
because
0.12
it
0.10
was
0.08
tired
0.06

Try clicking “it” — notice how it attends strongly to “cat” (weight 0.42). The model has learned that “it” refers to “cat” through the attention mechanism, without any explicit coreference annotation.

Analogy — Cocktail Party: Attention works like your brain at a cocktail party. Dozens of conversations happen simultaneously (all tokens), but you unconsciously tune into the one that's most relevant to you — your name, an interesting topic, a friend's voice. The “attention weights” are how strongly you tune into each conversation. High weight = focused listening; low weight = background noise you mostly ignore.

02

Query, Key, Value

Every attention head creates three vectors from each token: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). These projections are the foundation of the attention computation.

Step-by-step Q/K/V computation

Token embeddings enter the attention layer

4 tokens, each with a 3-dimensional embedding

X (4×3) — Input
1.00
0.50
0.30
0.20
0.80
0.60
0.70
0.10
0.90
0.40
0.60
0.20

The Full Formula

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

Analogy — Detective Connecting Clues: Q/K/V works like a detective investigating a case. The Query is the question you're asking (“Who had motive?”). The Key is the label on each evidence folder (“alibi,” “fingerprint,” “motive”). The Value is the actual content inside each folder. The detective (attention) matches questions to folder labels (Q·K), then reads the contents (V) of the most relevant folders, weighted by how strong each match is.

The weight matrices Wq, Wk, Wv are the only learned parameters in attention. During training, the model learns what makes a good query, what makes a good key, and what information to store in values — all through backpropagation.

03

Attention Scores

The attention score matrix shows how much each token attends to every other token. Each row is a probability distribution — it sums to 1.0 and represents where that token is “looking.”

Interactive heatmap — Click a row to inspect
The
cat
sat
on
the
mat
The
0.25
0.20
0.15
0.10
0.18
0.12
cat
0.18
0.28
0.14
0.08
0.12
0.20
sat
0.10
0.30
0.18
0.22
0.08
0.12
on
0.06
0.10
0.24
0.15
0.08
0.37
the
0.15
0.12
0.10
0.10
0.28
0.25
mat
0.08
0.12
0.18
0.28
0.16
0.18
Low attention
Medium
High attention

Row 3 (“sat”) attends most strongly to “cat” with weight 0.30. Each row is independently normalized by softmax — tokens compete for attention within each row.

Animated — How Attention Scores Are Computed
×
Q × K^TDot product of queries and keys
÷
÷ √d_kScale by dimension
MaskApply causal mask (-∞)
σ
SoftmaxNormalize to probabilities
·
× VWeighted sum of values

Attention(Q, K, V) = softmax(mask(Q·KT / √dk)) · V

04

Softmax & Temperature

Softmax converts raw attention logits into a probability distribution. Temperature controls how “sharp” or “flat” that distribution is — lower temperature means the model is more decisive, higher temperature means more uniform attention.

Interactive — Drag the temperature slider
Temperature1.00
0.1 (sharp)1.0 (standard)5.0 (uniform)

Raw Logits

The
2.0
cat
1.0
sat
0.1
on
-0.5
a
3.2
mat
0.8

After Softmax (T=1.00)

The
0.192
cat
0.070
sat
0.029
on
0.016
a
0.636
mat
0.058

Analogy — Magnifying Glass with Adjustable Focus: Temperature is like the focus ring on a magnifying glass. At low temperature (T=0.1), the lens is focused to a pinpoint — only one token gets nearly all the attention. At high temperature (T=5.0), the lens is defocused and light spreads evenly across everything. The 1/√dk scaling in standard attention is like choosing the default focus that works well for the lens size (dimension).

At T=0.1 the highest logit (3.2 for “a”) dominates — softmax outputs nearly 1.0 for that token. At T=5.0 the distribution flattens and all tokens get roughly equal weight. Standard attention uses T=1.0, but the scaling by 1/√d_k serves a similar purpose.

05

Multi-Head Attention

Instead of one attention function, transformers run multiple attention heads in parallel. Each head independently learns what to attend to — one might focus on syntax, another on position, another on coreference. Their outputs are concatenated and projected.

Auto-cycling — Click a head to inspect
Head 0: PositionalAttends to nearby tokens — captures local context and word order
The
cat
sat
on
the
mat
The
0.35
0.25
0.15
0.10
0.08
0.07
cat
0.22
0.30
0.25
0.10
0.07
0.06
sat
0.10
0.25
0.30
0.20
0.08
0.07
on
0.08
0.10
0.22
0.30
0.20
0.10
the
0.06
0.07
0.10
0.22
0.35
0.20
mat
0.05
0.06
0.08
0.12
0.25
0.44

In GPT-3, there are 96 attention heads per layer across 96 layers. Each head operates on a 128-dimensional subspace (12288 / 96 = 128). Research shows that different heads reliably specialize — some for syntax, some for coreference, some for positional patterns — without being explicitly trained to do so.

06

Masked (Causal) Attention

In autoregressive models like GPT, each token can only attend to tokens that came before it (and itself). This is enforced by masking — setting future positions to −∞ before softmax, which zeroes them out.

Interactive — Toggle the causal mask
Causal mask: ON
I
love
this
game
!
I
1.00
−∞
−∞
−∞
−∞
love
0.23
0.77
−∞
−∞
−∞
this
0.16
0.21
0.63
−∞
−∞
game
0.08
0.11
0.19
0.62
−∞
!
0.09
0.07
0.13
0.15
0.56

Why mask?

During generation, the model predicts one token at a time. It would be “cheating” if it could see future tokens. The mask enforces this causal constraint during both training and inference.

Effect on weights

With the mask on, token 0 (“I”) can only attend to itself → weight 1.00. Token 4 (“!”) can attend to all 5 tokens, so its weights are spread across the full context.

Analogy — Reading a Mystery Novel: Causal masking forces the model to read like a human reads a mystery — one page at a time, never peeking ahead. When predicting what comes after “The butler entered the,” the model cannot see that the next word is “kitchen.” This constraint is what makes autoregressive generation possible: the model must genuinely predict each word using only what came before, just like you'd guess the murderer before the last chapter.

Toggle the mask off to see how the attention weights redistribute. Notice that early tokens (like “I”) change the most — with the mask on they're forced to attend only to themselves, but without it they can look at all positions.

07

Cross-Attention

In encoder-decoder models, the decoder needs to look at the encoder's output to ground its predictions. Cross-attention does this: the Query comes from the decoder, while Keys and Values come from the encoder.

Encoder → Decoder flow

Encoder Output (French)

Le
chat
dort

Provides K and V

Decoder (English) — Click a token

Provides Q

Cross-attention weights

Decoder token “cat” attending to encoder outputs:

Le
0.10
chat
0.75
dort
0.15

Use Cases

  • Translation: decoder attends to source language tokens
  • Image captioning: decoder attends to image patch embeddings
  • Speech recognition: decoder attends to audio features

Click “cat” in the decoder — it attends 0.75 to “chat” (the French translation). Cross-attention creates the bridge between the two sequences, letting the decoder selectively read from the encoder's representation.

08

Flash Attention

Standard attention materializes the full N×N score matrix in GPU memory (HBM), which is O(N²). Flash Attention tiles the computation so that it fits in fast on-chip SRAM, dramatically reducing memory reads/writes while producing mathematically identical results.

Memory comparison
Sequence Length4,096
Standard Attention32 MB

O(N²) — stores full 4096×4096 attention matrix

Flash Attention3 MB

O(N) — only Q/K/V tiles in SRAM at a time

10.7×memory reduction
Hardware perspective
HBM

High Bandwidth Memory

~2 TB/s bandwidth, ~80 GB capacity (A100)

Standard attention reads Q, K, writes N×N scores to HBM, reads them back, writes output. This IO dominates runtime.

SRAM

On-Chip Static RAM

~19 TB/s bandwidth, ~20 MB capacity (A100)

Flash Attention tiles Q, K, V into blocks that fit in SRAM. Computes attention within each tile, accumulates output — never materializes the full N×N matrix.

Flash Attention doesn't approximate anything — it computes the exact same output as standard attention. The speedup comes entirely from being IO-aware: minimizing reads and writes to slow HBM by keeping data in fast SRAM through a careful tiling and recomputation strategy.

09

Attention Patterns

Not all attention is the same. Different architectures use different attention patterns to balance quality and efficiency. Research also reveals that trained heads develop recognizable patterns — like induction heads that enable in-context learning.

Pattern types — Auto-cycling

Each token attends to a fixed window of nearby tokens. Used in Longformer and BigBird for efficient long-context processing.

t0
t1
t2
t3
t4
t5
t6
t7
t0
t1
t2
t3
t4
t5
t6
t7
No attention
Weak
Strong

Induction heads are one of the most important discoveries in mechanistic interpretability. They allow models to “copy” from context — if the model sees “A B ... A”, an induction head learns to predict “B” will follow. This is a core mechanism behind in-context learning.

10

Beyond Standard Attention

Standard multi-head attention is expensive at inference time because every head maintains its own KV-cache. Modern variants like MQA and GQA reduce this cost by sharing Key and Value heads across multiple Query heads.

Variant comparison — Auto-cycling

Multi-Head (MHA)

The original design. Every attention head has its own Q, K, and V projections.

Query Heads (32)

KV Heads (32 KV heads)

Each Q head has its own dedicated KV head

KV-Cache calculator
Sequence Length4,096
1K32K64K128K

d_model=4096, d_k=128, 32 layers, fp16 — per-sequence KV-cache size:

Multi-Head (MHA)2.0 GB
Multi-Query (MQA)64 MB
Grouped-Query (GQA)512 MB

Also Worth Knowing

  • Sliding Window (Mistral): limits context to a fixed window, evicting old KV entries
  • Sparse Attention: combines local windows with global tokens for sub-quadratic cost
  • Ring Attention: distributes long sequences across devices with overlapping communication

At 128K sequence length, MHA requires 64.0 GB of KV-cache per sequence, while GQA needs only 16.0 GB — a 4× reduction. This is why nearly all modern LLMs (Llama 3, Mistral, Gemma 2) use GQA.

Building with Attention?

I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.

Get In Touch