Back to How LLMs Work
UNDER THE HOOD

Probability & Sampling

The mathematical backbone of language generation. From probability distributions and Bayes' theorem to softmax, sampling strategies, and why LLMs sometimes hallucinate with high confidence.

25 min readBy Hammad Abbasi Interactive
01

Probability Foundations

Before we can understand how language models choose the next token, we need the language of probability. Every prediction an LLM makes is a probability distribution over its vocabulary — tens of thousands of possible next tokens, each with a number between 0 and 1 that says how likely it is.

discrete probability distribution
16.7%
1
16.7%
2
16.7%
3
16.7%
4
16.7%
5
16.7%
6
P(x) sums to: 1.000E[X] = 3.50Distribution: Uniform
probability mass function

A Probability Mass Function (PMF) assigns a probability to each possible outcome of a discrete random variable. For a fair six-sided die, each face has P = 1/6. For an LLM, the “die” has 50,000+ faces — one per vocabulary token.

PMF PROPERTIES

  • 0 ≤ P(x) ≤ 1 for all x
  • Σ P(x) = 1
  • P(A ∪ B) = P(A) + P(B) for disjoint A, B

EXPECTED VALUE

E[X] = Σ x · P(x)

The “center of mass” of the distribution. For a fair die it's 3.5 — no single face, but the long-run average.

Analogy — Weighted Dice: Think of an LLM's vocabulary as a massive weighted die. Each face is a token, and the weights are the model's learned probabilities. “Rolling” this die is sampling — and how we read the result (greedy, top-k, top-p) determines the text we get.

02

Conditional Probability & Bayes

Language models are fundamentally conditional probability machines. The probability of the next token depends on everything that came before it. Bayes' theorem gives us the mathematical framework for updating beliefs given new evidence — the same logic LLMs use at every step.

bayes' theorem calculator
P(Disease) — Base rate1.0%
P(+|Disease) — Sensitivity95.0%
P(+|No Disease) — False positive rate5.0%
P(Disease | Positive Test)16.1%

Even with a 95% sensitive test, a 1% base rate means most positives are false alarms.

probability tree
bayes' formula

P(A|B) = P(B|A) · P(A) / P(B)

P(A)Prior — what we believe before seeing evidence
P(B|A)Likelihood — how likely the evidence is if A is true
P(A|B)Posterior — updated belief after seeing evidence

Analogy — Medical Test Accuracy: A rare disease (1% prevalence) with a 95% accurate test still produces mostly false positives. The same intuition applies to LLMs: a token might have high conditional probability in context, but that doesn't mean it's factually correct. Context (prior) matters enormously.

03

The Chain Rule for Sequences

A language model computes the probability of a sentence by breaking it into a chain of conditional probabilities — each token's likelihood given everything before it. This is the chain rule of probability applied to sequences.

chain rule decomposition

Click tokens to reveal conditional probabilities:

P(“The cat sat on the mat”) = P(The) × P(cat|The) × P(sat|The cat) × P(on|The cat sat) × P(the|The cat sat on) × P(mat|The cat sat on the)

The
cat
sat
on
the
mat
Joint probability: 1.0000e+0

Joint probabilities of sentences are astronomically small — that's why models work in log-space. Adding log-probabilities is numerically stable: log P(sentence) = Σ log P(token_i | context). This is also why perplexity (Section 05) uses the geometric mean rather than the raw product.

04

Information Theory

Information theory gives us the tools to measure uncertainty, compare distributions, and quantify how well a model has learned the true data distribution. These metrics — entropy, cross-entropy, and KL divergence — are the foundation of LLM training objectives.

distribution P (true)
P(A)0.250
P(B)0.250
P(C)0.250
P(D)0.250
distribution Q (model)
Q(A)0.400
Q(B)0.300
Q(C)0.200
Q(D)0.100
metrics
H(P) — Entropy2.000

Intrinsic uncertainty of P

H(P,Q) — Cross-Entropy2.176

Cost of using Q to encode P

KL(P‖Q) — Divergence0.176

Extra bits wasted by Q vs P

H(P,Q) = H(P) + KL(P‖Q) → 2.000 + 0.176 = 2.1762.176

Analogy — Information as Surprise: Information is the “surprise value” of an event. A coin flip (entropy = 1 bit) is mildly surprising. A fair 6-sided die (entropy ≈ 2.58 bits) is more surprising. A heavily loaded die (low entropy) is predictable and boring. LLM training minimizes cross-entropy — teaching the model to be as unsurprised by real text as possible.

05

Perplexity

Perplexity is the standard metric for evaluating language models. It measures how “surprised” a model is by a sequence of text — lower perplexity means the model assigns higher probabilities to the tokens it actually sees.

perplexity calculator
P(“The” | context)0.10
P(“quick” | context)0.30
P(“brown” | context)0.50
P(“fox” | context)0.20
P(“jumps” | context)0.40
Avg log₂ P-1.941
Perplexity3.8
Formula2^(-avg log₂P)
model comparison
GPT-2 (117M)
29.4
GPT-2 (1.5B)
18.3
GPT-3 (175B)
10.8
GPT-4 class
5.5

Lower perplexity = better model. Each generation roughly halves perplexity.

Analogy — Average Choices: Perplexity of 10 means the model is, on average, as confused as if it had to pick uniformly from 10 equally likely options at each step. A perplexity of 1 would mean perfect prediction. Real models achieve perplexities in the 5–30 range depending on the dataset and model size.

06

Softmax In Depth

The softmax function is the bridge between raw model outputs (logits) and a valid probability distribution. It takes a vector of arbitrary real numbers and transforms them into positive values that sum to 1, preserving relative ordering while amplifying differences.

softmax formula

softmax(zi) = exp(zi / T) / Σ exp(zj / T)

where T is temperature, z is the logit vector

temperature control
Temperature (T)1.00
0.1 — Confident1.0 — Neutral2.0 — Creative
logits (raw)
the
5.2
a
4.1
his
3.8
their
2.5
my
1.9
one
0.8
no
0.3
every
-0.5
softmax output
the
58.6%
a
19.5%
his
14.4%
their
3.9%
my
2.2%
one
0.7%
no
0.4%
every
0.2%
Sum: 1.000
full vocabulary view
the
a
his
their
my
one
no
every

In a real model, the vocabulary has 50,000+ tokens. The vast majority receive near-zero probability after softmax. Only a handful of tokens at the top of the distribution matter for generation.

Analogy — Volume Knob: Temperature is like a volume knob on a radio. At T=0.1, you turn the dial to “max contrast” — the top token dominates and output is nearly deterministic. At T=2.0, the signal spreads out across many tokens and the output becomes creative (or noisy). T=1.0 is the model's native voice.

07

Sampling Strategies

Once softmax gives us a probability distribution, we need to decide how to pick from it. Different sampling strategies trade off between quality, diversity, and creativity — each with distinct characteristics suited to different use cases.

prompt: "The capital of France is ___"
Paris
35.0%SELECTED
London
18.0%
Berlin
12.0%
Rome
9.0%
Madrid
7.0%
Tokyo
5.0%
Seoul
4.0%
Cairo
3.0%
Lima
2.5%
Oslo
2.0%
Baku
1.5%
Doha
1.0%
Eligible tokens: 1 / 12

Greedy decoding always picks the most likely token — deterministic but repetitive. Top-k and top-p filter the distribution before sampling, balancing quality and diversity. Min-p is a newer strategy that scales the cutoff relative to the top token, adapting naturally to confident vs uncertain distributions.

08

Temperature & Penalties

Temperature and penalties are the knobs that shape the probability distribution before sampling. They operate on the raw logit vector — adjusting, suppressing, and redistributing probability mass to control the character of generated text.

temperature
T1.00

Divides all logits before softmax. Low = sharp, high = flat.

repetition penalty
Rep1.00

Divides logits of previously seen tokens.

frequency penalty
Freq0.00

Subtracts penalty × occurrence count from logits.

logit modification pipeline
TokenRaw LogitOccurrencesModifiedP(token)
the5.235.2041.2%
cat4.824.8027.6%
sat4.104.1013.7%
on3.613.608.3%
mat3.003.004.6%
floor2.402.402.5%
and1.811.801.4%
ran1.201.200.8%
resulting distribution
41.2%
the×3
27.6%
cat×2
13.7%
sat
8.3%
on×1
4.6%
mat
2.5%
floor
1.4%
and×1
0.8%
ran

Analogy — Mixing Board: Think of these parameters as knobs on a mixing board. Temperature is the master volume — it affects every channel equally. Repetition penalty is a compressor that specifically attenuates channels that have already been loud. Frequency penalty is a gate that progressively silences repeat offenders.

10

Calibration & Hallucination

A model's softmax probability is not a measure of factual truth — it's a measure of statistical pattern match. High confidence and high accuracy are not the same thing, and this gap is at the heart of the hallucination problem in modern LLMs.

calibration curve
100%
0%
100%
Model Confidence
Actual Accuracy
Perfect calibration
Typical LLM

OVERCONFIDENCE GAP

60%
11%
70%
17%
80%
22%
90%
26%
95%
26%

Gap between confidence and actual accuracy widens at high confidence levels.

confidence vs correctness

Click a claim to reveal whether it's actually correct:

why this matters

SOFTMAX ≠ TRUTH

Softmax probabilities measure pattern match likelihood against training data, not factual accuracy. A model can be 99% confident about something completely false.

TRAINING BIAS

Models learn from internet text where common misconceptions appear frequently. Popular myths get high probability because they appear often, not because they're true.

MITIGATION

RAG (retrieval-augmented generation), chain-of-thought prompting, and calibration fine-tuning can reduce but not eliminate the overconfidence problem.

Analogy — Confident Student: Imagine a student who always raises their hand first and speaks with total conviction — but is wrong 30% of the time. That's an LLM with poor calibration. The confidence in their voice (softmax probability) doesn't correlate with actual knowledge. The most dangerous errors are the ones delivered with the highest confidence.

Building with LLMs?

I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.

Get In Touch