LLM Training Pipeline Explained — Under the Hood

The Data Pipeline

Before any model sees a single gradient, terabytes of raw text must be collected, cleaned, deduplicated, scored, and tokenized. The data pipeline is the foundation everything else is built on.

Pipeline Flow

Web Crawl

250B+ pages

Scrape billions of web pages from Common Crawl, books, scientific papers, and code repositories.

Common Crawl

250B+ pages

Largest public web scrape

The Pile

825 GiB

Curated diverse dataset by EleutherAI

RedPajama

1.2T tokens

Open reproduction of LLaMA training data

Analogy — Student Studying for Exams: Pre-training is like studying every textbook in the library — you absorb a vast amount of general knowledge. Fine-tuning is like focused exam prep — you drill on specific types of questions using curated study guides. You can't do exam prep without the general knowledge base, and general knowledge alone won't get you a perfect score on a specific test. Modern LLMs do both.

Data quality is the single biggest lever in LLM performance. Models trained on smaller but cleaner datasets consistently outperform those trained on larger but noisier ones. The Chinchilla scaling laws showed that most models were undertrained relative to their data budget.

Pre-training Objective

The core learning signal during pre-training. The model learns language by predicting missing or future tokens from context — no labels, no supervision, just raw text.

Next-Token Prediction

The model sees tokens left-to-right and predicts the next token at each position. Future tokens are masked — the model cannot peek ahead.

The?••••

Context: [The] → Predict: cat

Causal LM (next-token prediction) is the dominant pre-training objective for modern LLMs like GPT-4, LLaMA, and Mistral. Masked LM (BERT-style) excels at understanding tasks but cannot generate text autoregressively.

Distributed Training

Training a 70B+ parameter model on a single GPU is impossible. Distributed training splits computation across hundreds or thousands of GPUs using complementary parallelism strategies.

Multi-GPU Layout

Each GPU gets a full copy of the model but processes different data batches. Gradients are synchronized across GPUs after each step.

GPU 0

Full Model + Batch 1

GPU 1

Full Model + Batch 2

GPU 2

Full Model + Batch 3

GPU 3

Full Model + Batch 4

↔ All-Reduce gradients

2,048

A100 GPUs (LLaMA 2 70B)

21 days

Training duration

~$2M

Estimated compute cost

In practice, large training runs combine all strategies: ZeRO-3 for memory efficiency, tensor parallelism within nodes (fast NVLink), pipeline parallelism across nodes, and data parallelism for throughput. This is called 3D parallelism.

Mixed Precision & Optimization

Training in full FP32 precision is wasteful. Mixed-precision training uses lower-precision formats for most operations while keeping a master copy in FP32 for stability.

Bit Layout Comparison

FP32 (32 bits)±3.4×10³⁸

8 exp

23 mantissa

Full precision — highest accuracy, highest memory

FP16 (16 bits)±65,504

5 exp

10 mantissa

Half precision — fast but limited range, prone to overflow

BF16 (16 bits)±3.4×10³⁸

8 exp

7 mantissa

Brain float — same range as FP32, half the memory

Memory Calculator

Model Size: 7B parameters

28.0 GB

FP32 weights only

14.0 GB

FP16 weights only

14.0 GB

BF16 weights only

Training requires 4\u20138\u00D7 model weight memory for optimizer states, gradients, and activations.

BF16 has become the default training precision because it matches FP32's exponent range (no overflow issues) while using half the memory. Combined with loss scaling, it makes training stable at 16-bit with no accuracy loss for most architectures.

Learning Rate Schedules

The learning rate schedule controls how aggressively the model updates its weights during training. Getting this wrong can mean the difference between a working model and a diverged one.

Warmup + Cosine Decay

Warmup Steps: 2,000

Peak LR: 3×10⁻⁴

Total Steps: 100,000

No Warmup

Large initial gradients cause training instability — loss spikes and potential divergence.

LR Too High

Model overshoots minima, gradients explode, training diverges irreversibly.

LR Too Low

Training converges slowly or gets trapped in poor local minima. Wasted compute.

Analogy — Marathon Training Schedule: Learning rate schedules mirror how marathon runners train. You start with easy warm-up runs (warmup phase), build to peak intensity mid-cycle (peak learning rate), then taper down before race day to let your body consolidate gains (cosine decay). Going too hard too early causes injury (divergence); never pushing hard enough means you plateau.

Most modern LLMs use a warmup of 1–5% of total steps followed by cosine decay to near-zero. The peak learning rate is one of the most sensitive hyperparameters — typically 1–5×10⁻⁴ for large models.

Instruction Tuning (SFT)

Pre-trained models can predict the next token, but they don't know how to follow instructions. Supervised fine-tuning (SFT) teaches the model to respond helpfully using curated instruction-response pairs.

Before / After Comparison

Prompt

Explain quantum computing in simple terms.

Base Model Response

Quantum computing quantum mechanics quantum bits superposition entanglement quantum gates quantum algorithms quantum supremacy quantum error correction quantum decoherence quantum parallelism is a type of computation that harnesses quantum mechanical phenomena...

SFT Dataset Format

instruction"Explain quantum computing in simple terms."

input""

output"Quantum computing uses quantum bits (qubits)..."

Quality over quantity: research from LIMA (2023) showed that fine-tuning on just 1,000 carefully curated examples can match or exceed models trained on 100K+ noisy instruction pairs. The key is diverse, high-quality demonstrations.

RLHF

Reinforcement Learning from Human Feedback aligns model behavior with human values and preferences. It bridges the gap between “can generate text” and “generates text humans actually want.”

RLHF Pipeline

Collect Human Preferences

Human annotators compare pairs of model responses to the same prompt and choose which one is better. These preference pairs form the training signal for the reward model.

Prompt → Response A vs Response B → Human picks the winner

LLM→ Response→ Reward Model→ Score→ PPO Update→ Better LLM

Analogy — Sports Coaching: RLHF works like an athlete with a coach. The athlete (LLM) performs (generates responses), then the coach (reward model) scores each performance: “That answer was helpful and honest — 8/10” or “That was evasive and misleading — 2/10.” Through many rounds of perform → score → adjust, the athlete learns to consistently deliver what the coach values. The human annotators train the coach; the coach then scales to millions of evaluations.

Animated — RLHF Training Loop

✦

GenerateLLM produces responses

▸

⇅

RankHumans compare outputs

▸

★

ScoreReward model evaluates

▸

∇

UpdatePPO optimizes policy

Repeat for thousands of iterations until the model aligns with human preferences

RLHF was the key innovation that made ChatGPT feel aligned compared to base GPT-3. Without it, models generate plausible-sounding but unhelpful, harmful, or untruthful responses. The reward model acts as a proxy for human judgment at scale.

DPO & Alternatives

RLHF works, but it is complex and expensive. Direct Preference Optimization (DPO) and other alternatives simplify alignment by removing the need for a separate reward model or human annotators.

Pipeline Comparison

Collect preferences

Train reward model

PPO optimization

Pros

Well-studied, flexible, strong alignment signal

Cons

Complex pipeline, unstable, expensive (3 models in memory)

Separate reward model: Required

DPO has become the preferred alignment method for many open-source models (Zephyr, Mixtral-Instruct) because it reduces training complexity from three models to one, cuts compute costs significantly, and produces comparable alignment quality.

Evaluation & Benchmarks

How do you know if a model is actually good? Benchmarks provide standardized tests, but no single metric captures the full picture. Understanding what each benchmark measures — and its limitations — is critical.

Benchmark Comparison

MMLU/ 100

GPT-4

86.4

Claude 3

86.8

LLaMA 2 70B

68.9

Mistral 7B

62.5

HumanEval/ 100

GPT-4

Claude 3

84.9

LLaMA 2 70B

29.9

Mistral 7B

30.5

GSM8K/ 100

GPT-4

Claude 3

LLaMA 2 70B

56.8

Mistral 7B

52.2

MT-Bench/ 10

GPT-4

8.99

Claude 3

8.6

LLaMA 2 70B

6.86

Mistral 7B

7.6

MMLU

57 subjects from STEM to humanities. Tests breadth of world knowledge and reasoning.

HumanEval

164 Python programming problems. Tests code generation ability with unit tests.

GSM8K

8.5K grade school math problems. Tests multi-step mathematical reasoning.

MT-Bench

Multi-turn conversation quality scored by GPT-4 as judge. Tests instruction following and reasoning.

Perplexity

Perplexity measures how “surprised” a model is by text. Lower is better — a perplexity of 1 means the model perfectly predicts every token.

PPL = exp(−(1/N) × Σ log P(token_i | context))

Perplexity is useful for comparing models on the same dataset but doesn't directly measure usefulness or alignment.

No single benchmark tells the whole story. MMLU tests knowledge but not reasoning depth. HumanEval tests code but only Python. Models increasingly overfit to public benchmarks — contamination is a growing concern in the evaluation community.

Quantization & Deployment

A trained model is useless if you can't serve it. Quantization reduces model precision to shrink memory footprint and increase throughput, enabling deployment on smaller and cheaper hardware.

Quality vs Size

Precision: INT8

Quality Retention99.5%

Near-lossless. Standard production precision.

8 bits per parameter

Memory Calculator

Model Size: 7B parameters

14.0 GB

FP16

7.0 GB

INT8

3.5 GB

INT4

1.8 GB

INT2

vLLM

High-throughput serving with PagedAttention and continuous batching.

TGI

Hugging Face’s Text Generation Inference. Production-ready with token streaming.

llama.cpp

CPU/GPU inference in C++. Runs GGUF models on consumer hardware.

Analogy — JPEG Quality Slider: Quantization is like the quality slider when you save a JPEG. At 100% quality (FP32), the image is pristine but the file is huge. At 80% (INT8), it's visually identical but half the size. At 50% (INT4), you notice slight artifacts in complex areas, but it's a quarter the size. For most uses, 80% is the sweet spot — and models behave the same way.

For most production deployments, INT8 quantization is the sweet spot — near-zero quality loss with 2× memory savings. INT4 (AWQ or GPTQ) is ideal when you need to fit larger models on smaller GPUs or serve on consumer hardware.

The Data Pipeline

Web Crawl

Pre-training Objective

Distributed Training

Mixed Precision & Optimization

Learning Rate Schedules

Instruction Tuning (SFT)

RLHF

Collect Human Preferences

DPO & Alternatives

Evaluation & Benchmarks

Quantization & Deployment

Building with LLMs?