Back to How LLMs Work
UNDER THE HOOD

Training Pipeline

From raw internet text to aligned AI assistant — an interactive journey through every stage of modern LLM training: data curation, distributed pre-training, instruction tuning, RLHF, evaluation, and deployment.

25 min readBy Hammad Abbasi Interactive
01

The Data Pipeline

Before any model sees a single gradient, terabytes of raw text must be collected, cleaned, deduplicated, scored, and tokenized. The data pipeline is the foundation everything else is built on.

Pipeline Flow

Web Crawl

250B+ pages

Scrape billions of web pages from Common Crawl, books, scientific papers, and code repositories.

Common Crawl
250B+ pages
Largest public web scrape
The Pile
825 GiB
Curated diverse dataset by EleutherAI
RedPajama
1.2T tokens
Open reproduction of LLaMA training data

Analogy — Student Studying for Exams: Pre-training is like studying every textbook in the library — you absorb a vast amount of general knowledge. Fine-tuning is like focused exam prep — you drill on specific types of questions using curated study guides. You can't do exam prep without the general knowledge base, and general knowledge alone won't get you a perfect score on a specific test. Modern LLMs do both.

Data quality is the single biggest lever in LLM performance. Models trained on smaller but cleaner datasets consistently outperform those trained on larger but noisier ones. The Chinchilla scaling laws showed that most models were undertrained relative to their data budget.

02

Pre-training Objective

The core learning signal during pre-training. The model learns language by predicting missing or future tokens from context — no labels, no supervision, just raw text.

Next-Token Prediction

The model sees tokens left-to-right and predicts the next token at each position. Future tokens are masked — the model cannot peek ahead.

The?
Context: [The] → Predict: cat

Causal LM (next-token prediction) is the dominant pre-training objective for modern LLMs like GPT-4, LLaMA, and Mistral. Masked LM (BERT-style) excels at understanding tasks but cannot generate text autoregressively.

03

Distributed Training

Training a 70B+ parameter model on a single GPU is impossible. Distributed training splits computation across hundreds or thousands of GPUs using complementary parallelism strategies.

Multi-GPU Layout

Each GPU gets a full copy of the model but processes different data batches. Gradients are synchronized across GPUs after each step.

GPU 0
Full Model + Batch 1
GPU 1
Full Model + Batch 2
GPU 2
Full Model + Batch 3
GPU 3
Full Model + Batch 4
All-Reduce gradients
2,048
A100 GPUs (LLaMA 2 70B)
21 days
Training duration
~$2M
Estimated compute cost

In practice, large training runs combine all strategies: ZeRO-3 for memory efficiency, tensor parallelism within nodes (fast NVLink), pipeline parallelism across nodes, and data parallelism for throughput. This is called 3D parallelism.

04

Mixed Precision & Optimization

Training in full FP32 precision is wasteful. Mixed-precision training uses lower-precision formats for most operations while keeping a master copy in FP32 for stability.

Bit Layout Comparison
FP32 (32 bits)±3.4×10³⁸
S
8 exp
23 mantissa

Full precision — highest accuracy, highest memory

FP16 (16 bits)±65,504
S
5 exp
10 mantissa

Half precision — fast but limited range, prone to overflow

BF16 (16 bits)±3.4×10³⁸
S
8 exp
7 mantissa

Brain float — same range as FP32, half the memory

Memory Calculator
28.0 GB
FP32 weights only
14.0 GB
FP16 weights only
14.0 GB
BF16 weights only
Training requires 4\u20138\u00D7 model weight memory for optimizer states, gradients, and activations.

BF16 has become the default training precision because it matches FP32's exponent range (no overflow issues) while using half the memory. Combined with loss scaling, it makes training stable at 16-bit with no accuracy loss for most architectures.

05

Learning Rate Schedules

The learning rate schedule controls how aggressively the model updates its weights during training. Getting this wrong can mean the difference between a working model and a diverged one.

Warmup + Cosine Decay
warmup0100K3e-4
No Warmup
Large initial gradients cause training instability — loss spikes and potential divergence.
LR Too High
Model overshoots minima, gradients explode, training diverges irreversibly.
LR Too Low
Training converges slowly or gets trapped in poor local minima. Wasted compute.

Analogy — Marathon Training Schedule: Learning rate schedules mirror how marathon runners train. You start with easy warm-up runs (warmup phase), build to peak intensity mid-cycle (peak learning rate), then taper down before race day to let your body consolidate gains (cosine decay). Going too hard too early causes injury (divergence); never pushing hard enough means you plateau.

Most modern LLMs use a warmup of 1–5% of total steps followed by cosine decay to near-zero. The peak learning rate is one of the most sensitive hyperparameters — typically 1–5×10⁻⁴ for large models.

06

Instruction Tuning (SFT)

Pre-trained models can predict the next token, but they don't know how to follow instructions. Supervised fine-tuning (SFT) teaches the model to respond helpfully using curated instruction-response pairs.

Before / After Comparison
Prompt

Explain quantum computing in simple terms.

Base Model Response

Quantum computing quantum mechanics quantum bits superposition entanglement quantum gates quantum algorithms quantum supremacy quantum error correction quantum decoherence quantum parallelism is a type of computation that harnesses quantum mechanical phenomena...

SFT Dataset Format
instruction"Explain quantum computing in simple terms."
input""
output"Quantum computing uses quantum bits (qubits)..."

Quality over quantity: research from LIMA (2023) showed that fine-tuning on just 1,000 carefully curated examples can match or exceed models trained on 100K+ noisy instruction pairs. The key is diverse, high-quality demonstrations.

07

RLHF

Reinforcement Learning from Human Feedback aligns model behavior with human values and preferences. It bridges the gap between “can generate text” and “generates text humans actually want.”

RLHF Pipeline

Collect Human Preferences

Human annotators compare pairs of model responses to the same prompt and choose which one is better. These preference pairs form the training signal for the reward model.

Prompt → Response A vs Response B → Human picks the winner
LLM→ Response→ Reward Model→ Score→ PPO Update→ Better LLM

Analogy — Sports Coaching: RLHF works like an athlete with a coach. The athlete (LLM) performs (generates responses), then the coach (reward model) scores each performance: “That answer was helpful and honest — 8/10” or “That was evasive and misleading — 2/10.” Through many rounds of perform → score → adjust, the athlete learns to consistently deliver what the coach values. The human annotators train the coach; the coach then scales to millions of evaluations.

Animated — RLHF Training Loop
GenerateLLM produces responses
RankHumans compare outputs
ScoreReward model evaluates
UpdatePPO optimizes policy

Repeat for thousands of iterations until the model aligns with human preferences

RLHF was the key innovation that made ChatGPT feel aligned compared to base GPT-3. Without it, models generate plausible-sounding but unhelpful, harmful, or untruthful responses. The reward model acts as a proxy for human judgment at scale.

08

DPO & Alternatives

RLHF works, but it is complex and expensive. Direct Preference Optimization (DPO) and other alternatives simplify alignment by removing the need for a separate reward model or human annotators.

Pipeline Comparison
Collect preferences
Train reward model
PPO optimization
Pros

Well-studied, flexible, strong alignment signal

Cons

Complex pipeline, unstable, expensive (3 models in memory)

Separate reward model: Required

DPO has become the preferred alignment method for many open-source models (Zephyr, Mixtral-Instruct) because it reduces training complexity from three models to one, cuts compute costs significantly, and produces comparable alignment quality.

09

Evaluation & Benchmarks

How do you know if a model is actually good? Benchmarks provide standardized tests, but no single metric captures the full picture. Understanding what each benchmark measures — and its limitations — is critical.

Benchmark Comparison
MMLU/ 100
GPT-4
86.4
Claude 3
86.8
LLaMA 2 70B
68.9
Mistral 7B
62.5
HumanEval/ 100
GPT-4
67
Claude 3
84.9
LLaMA 2 70B
29.9
Mistral 7B
30.5
GSM8K/ 100
GPT-4
92
Claude 3
95
LLaMA 2 70B
56.8
Mistral 7B
52.2
MT-Bench/ 10
GPT-4
8.99
Claude 3
8.6
LLaMA 2 70B
6.86
Mistral 7B
7.6
MMLU
57 subjects from STEM to humanities. Tests breadth of world knowledge and reasoning.
HumanEval
164 Python programming problems. Tests code generation ability with unit tests.
GSM8K
8.5K grade school math problems. Tests multi-step mathematical reasoning.
MT-Bench
Multi-turn conversation quality scored by GPT-4 as judge. Tests instruction following and reasoning.
Perplexity

Perplexity measures how “surprised” a model is by text. Lower is better — a perplexity of 1 means the model perfectly predicts every token.

PPL = exp(−(1/N) × Σ log P(token_i | context))

Perplexity is useful for comparing models on the same dataset but doesn't directly measure usefulness or alignment.

No single benchmark tells the whole story. MMLU tests knowledge but not reasoning depth. HumanEval tests code but only Python. Models increasingly overfit to public benchmarks — contamination is a growing concern in the evaluation community.

10

Quantization & Deployment

A trained model is useless if you can't serve it. Quantization reduces model precision to shrink memory footprint and increase throughput, enabling deployment on smaller and cheaper hardware.

Quality vs Size
Quality Retention99.5%

Near-lossless. Standard production precision.

8 bits per parameter
Memory Calculator
14.0 GB
FP16
7.0 GB
INT8
3.5 GB
INT4
1.8 GB
INT2
vLLM
High-throughput serving with PagedAttention and continuous batching.
TGI
Hugging Face’s Text Generation Inference. Production-ready with token streaming.
llama.cpp
CPU/GPU inference in C++. Runs GGUF models on consumer hardware.

Analogy — JPEG Quality Slider: Quantization is like the quality slider when you save a JPEG. At 100% quality (FP32), the image is pristine but the file is huge. At 80% (INT8), it's visually identical but half the size. At 50% (INT4), you notice slight artifacts in complex areas, but it's a quarter the size. For most uses, 80% is the sweet spot — and models behave the same way.

For most production deployments, INT8 quantization is the sweet spot — near-zero quality loss with 2× memory savings. INT4 (AWQ or GPTQ) is ideal when you need to fit larger models on smaller GPUs or serve on consumer hardware.

Building with LLMs?

I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.

Get In Touch