Probability Foundations
Before we can understand how language models choose the next token, we need the language of probability. Every prediction an LLM makes is a probability distribution over its vocabulary — tens of thousands of possible next tokens, each with a number between 0 and 1 that says how likely it is.
A Probability Mass Function (PMF) assigns a probability to each possible outcome of a discrete random variable. For a fair six-sided die, each face has P = 1/6. For an LLM, the “die” has 50,000+ faces — one per vocabulary token.
PMF PROPERTIES
- 0 ≤ P(x) ≤ 1 for all x
- Σ P(x) = 1
- P(A ∪ B) = P(A) + P(B) for disjoint A, B
EXPECTED VALUE
E[X] = Σ x · P(x)
The “center of mass” of the distribution. For a fair die it's 3.5 — no single face, but the long-run average.
Analogy — Weighted Dice: Think of an LLM's vocabulary as a massive weighted die. Each face is a token, and the weights are the model's learned probabilities. “Rolling” this die is sampling — and how we read the result (greedy, top-k, top-p) determines the text we get.
Conditional Probability & Bayes
Language models are fundamentally conditional probability machines. The probability of the next token depends on everything that came before it. Bayes' theorem gives us the mathematical framework for updating beliefs given new evidence — the same logic LLMs use at every step.
Even with a 95% sensitive test, a 1% base rate means most positives are false alarms.
P(A|B) = P(B|A) · P(A) / P(B)
Analogy — Medical Test Accuracy: A rare disease (1% prevalence) with a 95% accurate test still produces mostly false positives. The same intuition applies to LLMs: a token might have high conditional probability in context, but that doesn't mean it's factually correct. Context (prior) matters enormously.
The Chain Rule for Sequences
A language model computes the probability of a sentence by breaking it into a chain of conditional probabilities — each token's likelihood given everything before it. This is the chain rule of probability applied to sequences.
Click tokens to reveal conditional probabilities:
P(“The cat sat on the mat”) = P(The) × P(cat|The) × P(sat|The cat) × P(on|The cat sat) × P(the|The cat sat on) × P(mat|The cat sat on the)
Joint probabilities of sentences are astronomically small — that's why models work in log-space. Adding log-probabilities is numerically stable: log P(sentence) = Σ log P(token_i | context). This is also why perplexity (Section 05) uses the geometric mean rather than the raw product.
Information Theory
Information theory gives us the tools to measure uncertainty, compare distributions, and quantify how well a model has learned the true data distribution. These metrics — entropy, cross-entropy, and KL divergence — are the foundation of LLM training objectives.
Intrinsic uncertainty of P
Cost of using Q to encode P
Extra bits wasted by Q vs P
H(P,Q) = H(P) + KL(P‖Q) → 2.000 + 0.176 = 2.176 ≈ 2.176
Analogy — Information as Surprise: Information is the “surprise value” of an event. A coin flip (entropy = 1 bit) is mildly surprising. A fair 6-sided die (entropy ≈ 2.58 bits) is more surprising. A heavily loaded die (low entropy) is predictable and boring. LLM training minimizes cross-entropy — teaching the model to be as unsurprised by real text as possible.
Perplexity
Perplexity is the standard metric for evaluating language models. It measures how “surprised” a model is by a sequence of text — lower perplexity means the model assigns higher probabilities to the tokens it actually sees.
Lower perplexity = better model. Each generation roughly halves perplexity.
Analogy — Average Choices: Perplexity of 10 means the model is, on average, as confused as if it had to pick uniformly from 10 equally likely options at each step. A perplexity of 1 would mean perfect prediction. Real models achieve perplexities in the 5–30 range depending on the dataset and model size.
Softmax In Depth
The softmax function is the bridge between raw model outputs (logits) and a valid probability distribution. It takes a vector of arbitrary real numbers and transforms them into positive values that sum to 1, preserving relative ordering while amplifying differences.
softmax(zi) = exp(zi / T) / Σ exp(zj / T)
where T is temperature, z is the logit vector
In a real model, the vocabulary has 50,000+ tokens. The vast majority receive near-zero probability after softmax. Only a handful of tokens at the top of the distribution matter for generation.
Analogy — Volume Knob: Temperature is like a volume knob on a radio. At T=0.1, you turn the dial to “max contrast” — the top token dominates and output is nearly deterministic. At T=2.0, the signal spreads out across many tokens and the output becomes creative (or noisy). T=1.0 is the model's native voice.
Sampling Strategies
Once softmax gives us a probability distribution, we need to decide how to pick from it. Different sampling strategies trade off between quality, diversity, and creativity — each with distinct characteristics suited to different use cases.
Greedy decoding always picks the most likely token — deterministic but repetitive. Top-k and top-p filter the distribution before sampling, balancing quality and diversity. Min-p is a newer strategy that scales the cutoff relative to the top token, adapting naturally to confident vs uncertain distributions.
Temperature & Penalties
Temperature and penalties are the knobs that shape the probability distribution before sampling. They operate on the raw logit vector — adjusting, suppressing, and redistributing probability mass to control the character of generated text.
Divides all logits before softmax. Low = sharp, high = flat.
Divides logits of previously seen tokens.
Subtracts penalty × occurrence count from logits.
| Token | Raw Logit | Occurrences | Modified | P(token) |
|---|---|---|---|---|
| the | 5.2 | 3 | 5.20 | 41.2% |
| cat | 4.8 | 2 | 4.80 | 27.6% |
| sat | 4.1 | 0 | 4.10 | 13.7% |
| on | 3.6 | 1 | 3.60 | 8.3% |
| mat | 3.0 | 0 | 3.00 | 4.6% |
| floor | 2.4 | 0 | 2.40 | 2.5% |
| and | 1.8 | 1 | 1.80 | 1.4% |
| ran | 1.2 | 0 | 1.20 | 0.8% |
Analogy — Mixing Board: Think of these parameters as knobs on a mixing board. Temperature is the master volume — it affects every channel equally. Repetition penalty is a compressor that specifically attenuates channels that have already been loud. Frequency penalty is a gate that progressively silences repeat offenders.
Beam Search vs Sampling
Beam search and sampling represent two fundamentally different approaches to text generation. Beam search systematically explores multiple paths to find the most probable sequence, while sampling introduces randomness for creative, diverse outputs.
| Property | Beam Search | Sampling |
|---|---|---|
| Deterministic? | Yes — same input, same output | No — each run differs |
| Quality | High average quality, can be generic | Variable — sometimes brilliant, sometimes poor |
| Diversity | Low — tends to converge | High — explores the distribution |
| Speed | Slower (parallel beams) | Faster (single path) |
| Best for | Translation, summarization | Creative writing, chatbots, brainstorming |
Most modern chatbots (ChatGPT, Claude) use sampling rather than beam search. The reasoning: for open-ended conversation, diversity and naturalness matter more than finding the single most probable sequence. Beam search is still preferred for tasks like machine translation where fidelity is paramount.
Calibration & Hallucination
A model's softmax probability is not a measure of factual truth — it's a measure of statistical pattern match. High confidence and high accuracy are not the same thing, and this gap is at the heart of the hallucination problem in modern LLMs.
OVERCONFIDENCE GAP
Gap between confidence and actual accuracy widens at high confidence levels.
Click a claim to reveal whether it's actually correct:
SOFTMAX ≠ TRUTH
Softmax probabilities measure pattern match likelihood against training data, not factual accuracy. A model can be 99% confident about something completely false.
TRAINING BIAS
Models learn from internet text where common misconceptions appear frequently. Popular myths get high probability because they appear often, not because they're true.
MITIGATION
RAG (retrieval-augmented generation), chain-of-thought prompting, and calibration fine-tuning can reduce but not eliminate the overconfidence problem.
Analogy — Confident Student: Imagine a student who always raises their hand first and speaks with total conviction — but is wrong 30% of the time. That's an LLM with poor calibration. The confidence in their voice (softmax probability) doesn't correlate with actual knowledge. The most dangerous errors are the ones delivered with the highest confidence.