Back to How LLMs Work
UNDER THE HOOD

Loss & Gradient Descent

How neural networks measure their mistakes and learn from them. From loss functions and optimization algorithms to backpropagation and convergence — the math that makes deep learning work.

01

What Is a Loss Function?

Before a neural network can learn, it needs a way to measure how wrong it is. The loss function assigns a single number to every prediction — the lower, the better. Training is the process of minimizing this number.

Analogy — GPS Navigation

Think of a loss function like GPS recalculating your distance to the destination. You don't need to know the exact route — you just need a number that tells you how far off you are. Every turn you take either increases or decreases that distance. The model's "turns" are weight updates, and the loss is the remaining distance.

DESTINATION = CORRECT ANSWER
Loss → 0
Interactive — Predicted vs Actual
Catpred: 0.7 | actual: 1.0
Dogpred: 0.2 | actual: 0.0
Birdpred: 0.1 | actual: 0.0
Mean Squared Error0.0467

The loss function is the only signal a neural network has about what "correct" means. Without it, there is no learning — just random computation. Every architectural choice, every training trick, ultimately serves one goal: driving this number down.

02

Common Loss Functions

Different tasks need different loss functions. Each one encodes a different philosophy about what "wrong" means and how severely to penalize mistakes. Think of them as different grading rubrics — some reward partial credit, others are all-or-nothing.

Mean Squared Error
Formula

L = (1/n) Σ (yᵢ - ŷᵢ)²

When to use

Regression tasks — predicting continuous values like prices, temperatures, or distances.

Penalty behavior

Penalizes large errors quadratically. An error of 2 costs 4× more than an error of 1.

Error Penalty Curve
0.00
0.25
0.50
0.75
1.00
Error magnitude →

Cross-entropy is the workhorse of modern LLMs. At each position in a sequence, the model outputs a probability distribution over the entire vocabulary, and cross-entropy measures how far that distribution is from the one-hot "correct next token." This single loss drives the entire training process.

03

The Loss Landscape

Imagine the loss function as a mountain range. Each point in this landscape represents a different combination of model weights, and the altitude is the loss value. Training is like hiking through fog, trying to find the lowest valley.

Loss Landscape Visualization
LocalMinimumSaddlePointGlobalMinimum

Local Minimum

A valley that looks lowest from nearby but isn't the deepest overall. Optimizers can get trapped here.

Saddle Point

A point that slopes down in some directions and up in others — like a mountain pass. Flat gradients stall learning.

Global Minimum

The absolute lowest point in the landscape. In practice, finding it exactly is often impossible — good enough is good enough.

Modern neural networks have millions or billions of parameters, creating a loss landscape with that many dimensions. Surprisingly, research suggests that in very high dimensions, local minima are rare — most critical points are saddle points. The real challenge is navigating these vast, flat regions efficiently.

04

Gradient Descent

Gradient descent is how neural networks learn. At every step, the model calculates the slope of the loss function at its current position, then takes a step in the direction that reduces loss. It's like walking downhill blindfolded — all you can feel is the slope under your feet.

Gradient Descent Visualization
step 0
The Update Rule
θ = θ - α · ∇L
θModel parameters (weights)
αLearning rate — step size
∇LGradient of the loss — direction of steepest ascent

The negative sign is critical: we move opposite to the gradient because the gradient points uphill. We want to go downhill — toward lower loss.

In practice, computing the exact gradient over the entire dataset is too expensive. Stochastic Gradient Descent (SGD) estimates the gradient from a small random batch of data. This introduces noise, but that noise actually helps — it can bounce the optimizer out of shallow local minima.

05

Learning Rate Deep Dive

The learning rate is arguably the single most important hyperparameter. It controls how big each step is during gradient descent. Imagine walking downhill — if your steps are too small, you'll take all day. Too large, and you'll overshoot the bottom and tumble.

Learning Rate — Too Small
Loss over training stepsLossSteps
α = 0.0001

The model barely moves. Training takes forever, and it may never reach the minimum within a reasonable number of epochs.

Modern training often uses learning rate schedules — starting with a warmup phase (small → large), then gradually decaying via cosine annealing or step decay. This gives the best of both worlds: fast early progress and fine-grained convergence at the end.

06

Optimization Algorithms

Gradient descent has many variants, each with a different strategy for navigating the loss landscape. Think of them as different driving styles — all heading to the same destination, but some handle curves and hills better than others.

Optimizer — SGD
min
Key Idea

Go downhill, nothing else.

Vanilla stochastic gradient descent — follows the raw gradient with a fixed step size. Simple but can be slow on flat regions and oscillate in narrow valleys.

Adam (Adaptive Moment Estimation) is used in the training of virtually all modern LLMs, including GPT-4 and Claude. Its combination of momentum and per-parameter learning rates makes it robust across a wide range of architectures and datasets, though variants like AdamW (with decoupled weight decay) are now preferred.

07

Backpropagation in Detail

Backpropagation is how gradients are efficiently computed through an entire neural network. The forward pass computes the prediction; the backward pass distributes blame for the error back to every weight — like an assembly line tracing a defective product back to the responsible station.

Phase — Forward Pass
InputHidden 1Hidden 2OutputData flows → toward prediction
The Chain Rule

∂L/∂w₁ = ∂L/∂ŷ · ∂ŷ/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂w₁

To find how much a weight in an early layer affects the final loss, multiply the local derivatives along the path from output to that weight. Each layer contributes one factor in this chain of derivatives.

Why It's Efficient

Without backpropagation, computing the gradient of each weight would require a separate forward pass with a tiny perturbation. For a model with 1 billion parameters, that would mean 1 billion forward passes per training step.

Backpropagation computes all gradients in a single backward pass by reusing intermediate results. The cost is roughly 2× a single forward pass — regardless of parameter count.

Backpropagation was described by Rumelhart, Hinton, and Williams in 1986, but its importance wasn't fully appreciated until decades later when deep networks and GPUs made large-scale gradient computation practical. It remains the single most important algorithm in deep learning.

08

Vanishing & Exploding Gradients

In deep networks, gradients must travel through many layers during backpropagation. If they shrink at each layer, early layers learn nothing (vanishing). If they grow, training becomes unstable (exploding). These are the fundamental challenges of deep learning.

Vanishing Gradients
Output
1.000
Layer 5
0.450
Layer 4
0.200
Layer 3
0.080
Layer 2
0.020
Layer 1
0.003

Like a whisper game: the gradient signal fades as it passes through each layer, until early layers receive essentially zero learning signal.

Solutions

Residual Connections

Skip connections let gradients flow directly through the network, bypassing problematic layers. Used in every modern transformer.

Gradient Clipping

Cap the maximum gradient norm to prevent explosion. If gradients exceed a threshold, scale them down proportionally.

Batch Normalization

Normalize layer inputs to have zero mean and unit variance, stabilizing the gradient magnitude across layers.

Careful Initialization

Xavier/Glorot and He initialization set initial weights to keep signal variance stable across layers.

Residual connections (introduced in ResNet, 2015) are arguably the single most important architectural innovation for training deep networks. By adding the input of a layer directly to its output, gradients have a highway that bypasses nonlinear transformations. Every modern LLM uses them.

09

Regularization

A model that memorizes the training data perfectly but fails on new data is useless. Regularization techniques prevent this overfitting — like studying key concepts for an exam instead of memorizing every word in the textbook. You want understanding, not rote recall.

Training vs Validation Loss — No Regularization
LossEpochsTrainVal

The model memorizes the training data — perfect training accuracy but poor generalization. Training loss drops to zero while validation loss climbs.

L1 (Lasso)

Adds sum of absolute weights. Drives unimportant weights to exactly zero, producing sparse models.

L2 (Ridge)

Adds sum of squared weights. Shrinks all weights uniformly, discouraging any single weight from dominating.

Dropout

Randomly zeroes neuron outputs during training. Each forward pass trains a different sub-network.

Modern LLMs use relatively little explicit regularization. Instead, the sheer volume of training data acts as an implicit regularizer — with trillions of tokens, overfitting individual examples is nearly impossible. Weight decay (L2) is still used, but techniques like dropout are often unnecessary at scale.

10

Training Convergence

Training a neural network is knowing when to stop. Stop too early and the model is underfitted. Stop too late and you've wasted compute (or overfit). Loss curves, early stopping, and learning rate schedules help you find the sweet spot — like knowing when to stop practicing before a performance.

Live Loss Curve
LossEpochwarmupearly stopepoch 0/50
Loss: 2.6700

Early Stopping

Monitor validation loss and stop training when it starts to increase for several consecutive epochs (patience). This prevents overfitting without needing to guess the right number of epochs.

Learning Rate Warmup

Start with a very small learning rate and linearly increase it for the first few hundred/thousand steps. This prevents early instability when the model weights are still random.

Cosine Annealing

After warmup, decay the learning rate following a cosine curve toward zero. Provides fast early learning and fine-grained convergence near the end of training.

Training Convergence Decision Tree

Large LLM training runs (like GPT-4) can cost tens of millions of dollars in compute. There are no second chances — if the loss curve behaves unexpectedly at epoch 1000 of a 2000-epoch run, teams must diagnose and decide in real time whether to adjust hyperparameters, roll back, or continue. Monitoring convergence is both a science and a high-stakes operational skill.