Probability & Sampling Explained — Under the Hood

Probability Foundations

Before we can understand how language models choose the next token, we need the language of probability. Every prediction an LLM makes is a probability distribution over its vocabulary — tens of thousands of possible next tokens, each with a number between 0 and 1 that says how likely it is.

discrete probability distribution

16.7%

P(x) sums to: 1.000E[X] = 3.50Distribution: Uniform

probability mass function

A Probability Mass Function (PMF) assigns a probability to each possible outcome of a discrete random variable. For a fair six-sided die, each face has P = 1/6. For an LLM, the “die” has 50,000+ faces — one per vocabulary token.

PMF PROPERTIES

0 ≤ P(x) ≤ 1 for all x
Σ P(x) = 1
P(A ∪ B) = P(A) + P(B) for disjoint A, B

EXPECTED VALUE

E[X] = Σ x · P(x)

The “center of mass” of the distribution. For a fair die it's 3.5 — no single face, but the long-run average.

Analogy — Weighted Dice: Think of an LLM's vocabulary as a massive weighted die. Each face is a token, and the weights are the model's learned probabilities. “Rolling” this die is sampling — and how we read the result (greedy, top-k, top-p) determines the text we get.

Conditional Probability & Bayes

Language models are fundamentally conditional probability machines. The probability of the next token depends on everything that came before it. Bayes' theorem gives us the mathematical framework for updating beliefs given new evidence — the same logic LLMs use at every step.

bayes' theorem calculator

P(Disease) — Base rate1.0%

P(+|Disease) — Sensitivity95.0%

P(+|No Disease) — False positive rate5.0%

P(Disease | Positive Test)16.1%

Even with a 95% sensitive test, a 1% base rate means most positives are false alarms.

probability tree

bayes' formula

P(A|B) = P(B|A) · P(A) / P(B)

P(A)Prior — what we believe before seeing evidence

P(B|A)Likelihood — how likely the evidence is if A is true

P(A|B)Posterior — updated belief after seeing evidence

Analogy — Medical Test Accuracy: A rare disease (1% prevalence) with a 95% accurate test still produces mostly false positives. The same intuition applies to LLMs: a token might have high conditional probability in context, but that doesn't mean it's factually correct. Context (prior) matters enormously.

The Chain Rule for Sequences

A language model computes the probability of a sentence by breaking it into a chain of conditional probabilities — each token's likelihood given everything before it. This is the chain rule of probability applied to sequences.

chain rule decomposition

Click tokens to reveal conditional probabilities:

The

cat

sat

the

mat

Joint probability: 1.0000e+0

Joint probabilities of sentences are astronomically small — that's why models work in log-space. Adding log-probabilities is numerically stable: log P(sentence) = Σ log P(token_i | context). This is also why perplexity (Section 05) uses the geometric mean rather than the raw product.

Information Theory

Information theory gives us the tools to measure uncertainty, compare distributions, and quantify how well a model has learned the true data distribution. These metrics — entropy, cross-entropy, and KL divergence — are the foundation of LLM training objectives.

distribution P (true)

P(A)0.250

P(B)0.250

P(C)0.250

P(D)0.250

distribution Q (model)

Q(A)0.400

Q(B)0.300

Q(C)0.200

Q(D)0.100

metrics

H(P) — Entropy2.000

Intrinsic uncertainty of P

H(P,Q) — Cross-Entropy2.176

Cost of using Q to encode P

KL(P‖Q) — Divergence0.176

Extra bits wasted by Q vs P

H(P,Q) = H(P) + KL(P‖Q) → 2.000 + 0.176 = 2.176 ≈ 2.176

Analogy — Information as Surprise: Information is the “surprise value” of an event. A coin flip (entropy = 1 bit) is mildly surprising. A fair 6-sided die (entropy ≈ 2.58 bits) is more surprising. A heavily loaded die (low entropy) is predictable and boring. LLM training minimizes cross-entropy — teaching the model to be as unsurprised by real text as possible.

Perplexity

Perplexity is the standard metric for evaluating language models. It measures how “surprised” a model is by a sequence of text — lower perplexity means the model assigns higher probabilities to the tokens it actually sees.

perplexity calculator

P(“The” | context)0.10

P(“quick” | context)0.30

P(“brown” | context)0.50

P(“fox” | context)0.20

P(“jumps” | context)0.40

Avg log₂ P-1.941

Perplexity3.8

Formula2^(-avg log₂P)

model comparison

GPT-2 (117M)

29.4

GPT-2 (1.5B)

18.3

GPT-3 (175B)

10.8

GPT-4 class

5.5

Lower perplexity = better model. Each generation roughly halves perplexity.

Analogy — Average Choices: Perplexity of 10 means the model is, on average, as confused as if it had to pick uniformly from 10 equally likely options at each step. A perplexity of 1 would mean perfect prediction. Real models achieve perplexities in the 5–30 range depending on the dataset and model size.

Softmax In Depth

The softmax function is the bridge between raw model outputs (logits) and a valid probability distribution. It takes a vector of arbitrary real numbers and transforms them into positive values that sum to 1, preserving relative ordering while amplifying differences.

softmax formula

softmax(z_i) = exp(z_i / T) / Σ exp(z_j / T)

where T is temperature, z is the logit vector

temperature control

Temperature (T)1.00

0.1 — Confident1.0 — Neutral2.0 — Creative

logits (raw)

the

5.2

4.1

his

3.8

their

2.5

1.9

one

0.8

0.3

every

-0.5

softmax output

the

58.6%

19.5%

his

14.4%

their

3.9%

2.2%

one

0.7%

0.4%

every

0.2%

Sum: 1.000

full vocabulary view

the

his

their

one

every

…

In a real model, the vocabulary has 50,000+ tokens. The vast majority receive near-zero probability after softmax. Only a handful of tokens at the top of the distribution matter for generation.

Analogy — Volume Knob: Temperature is like a volume knob on a radio. At T=0.1, you turn the dial to “max contrast” — the top token dominates and output is nearly deterministic. At T=2.0, the signal spreads out across many tokens and the output becomes creative (or noisy). T=1.0 is the model's native voice.

Sampling Strategies

Once softmax gives us a probability distribution, we need to decide how to pick from it. Different sampling strategies trade off between quality, diversity, and creativity — each with distinct characteristics suited to different use cases.

prompt: "The capital of France is ___"

Paris

35.0%SELECTED

London

18.0%

Berlin

12.0%

Rome

9.0%

Madrid

7.0%

Tokyo

5.0%

Seoul

4.0%

Cairo

3.0%

Lima

2.5%

Oslo

2.0%

Baku

1.5%

Doha

1.0%

Eligible tokens: 1 / 12

Greedy decoding always picks the most likely token — deterministic but repetitive. Top-k and top-p filter the distribution before sampling, balancing quality and diversity. Min-p is a newer strategy that scales the cutoff relative to the top token, adapting naturally to confident vs uncertain distributions.

Temperature & Penalties

Temperature and penalties are the knobs that shape the probability distribution before sampling. They operate on the raw logit vector — adjusting, suppressing, and redistributing probability mass to control the character of generated text.

temperature

T1.00

Divides all logits before softmax. Low = sharp, high = flat.

repetition penalty

Rep1.00

Divides logits of previously seen tokens.

frequency penalty

Freq0.00

Subtracts penalty × occurrence count from logits.

logit modification pipeline

Token	Raw Logit	Occurrences	Modified	P(token)
the	5.2	3	5.20	41.2%
cat	4.8	2	4.80	27.6%
sat	4.1	0	4.10	13.7%
on	3.6	1	3.60	8.3%
mat	3.0	0	3.00	4.6%
floor	2.4	0	2.40	2.5%
and	1.8	1	1.80	1.4%
ran	1.2	0	1.20	0.8%

resulting distribution

41.2%

the×3

27.6%

cat×2

13.7%

sat

8.3%

on×1

4.6%

mat

2.5%

floor

1.4%

and×1

0.8%

ran

Analogy — Mixing Board: Think of these parameters as knobs on a mixing board. Temperature is the master volume — it affects every channel equally. Repetition penalty is a compressor that specifically attenuates channels that have already been loud. Frequency penalty is a gate that progressively silences repeat offenders.

Beam Search vs Sampling

Beam search and sampling represent two fundamentally different approaches to text generation. Beam search systematically explores multiple paths to find the most probable sequence, while sampling introduces randomness for creative, diverse outputs.

beam search tree (width=3)

The-0.12

Beams3

Best Score-0.92

Best PathThe cat sat

comparison

Property	Beam Search	Sampling
Deterministic?	Yes — same input, same output	No — each run differs
Quality	High average quality, can be generic	Variable — sometimes brilliant, sometimes poor
Diversity	Low — tends to converge	High — explores the distribution
Speed	Slower (parallel beams)	Faster (single path)
Best for	Translation, summarization	Creative writing, chatbots, brainstorming

Most modern chatbots (ChatGPT, Claude) use sampling rather than beam search. The reasoning: for open-ended conversation, diversity and naturalness matter more than finding the single most probable sequence. Beam search is still preferred for tasks like machine translation where fidelity is paramount.

Calibration & Hallucination

A model's softmax probability is not a measure of factual truth — it's a measure of statistical pattern match. High confidence and high accuracy are not the same thing, and this gap is at the heart of the hallucination problem in modern LLMs.

calibration curve

100%

Model Confidence

Actual Accuracy

Perfect calibration

Typical LLM

OVERCONFIDENCE GAP

60%

11%

70%

17%

80%

22%

90%

26%

95%

26%

Gap between confidence and actual accuracy widens at high confidence levels.

confidence vs correctness

Click a claim to reveal whether it's actually correct:

why this matters

SOFTMAX ≠ TRUTH

Softmax probabilities measure pattern match likelihood against training data, not factual accuracy. A model can be 99% confident about something completely false.

TRAINING BIAS

Models learn from internet text where common misconceptions appear frequently. Popular myths get high probability because they appear often, not because they're true.

MITIGATION

RAG (retrieval-augmented generation), chain-of-thought prompting, and calibration fine-tuning can reduce but not eliminate the overconfidence problem.

Analogy — Confident Student: Imagine a student who always raises their hand first and speaks with total conviction — but is wrong 30% of the time. That's an LLM with poor calibration. The confidence in their voice (softmax probability) doesn't correlate with actual knowledge. The most dangerous errors are the ones delivered with the highest confidence.

Probability & Sampling

Probability Foundations

Uniform

Loaded (favors 6)

Bimodal

Left-skewed

PMF PROPERTIES

EXPECTED VALUE

Conditional Probability & Bayes

The Chain Rule for Sequences

Information Theory

Perplexity

Softmax In Depth

Sampling Strategies

Greedy

Random

Top-k (k=5)

Top-p (p=0.85)

Min-p (0.1)

Temperature & Penalties

Beam Search vs Sampling

Beam Search

Sampling

Calibration & Hallucination

OVERCONFIDENCE GAP

SOFTMAX ≠ TRUTH

TRAINING BIAS

MITIGATION

Building with LLMs?