The Data Pipeline
Before any model sees a single gradient, terabytes of raw text must be collected, cleaned, deduplicated, scored, and tokenized. The data pipeline is the foundation everything else is built on.
Web Crawl
250B+ pagesScrape billions of web pages from Common Crawl, books, scientific papers, and code repositories.
Analogy — Student Studying for Exams: Pre-training is like studying every textbook in the library — you absorb a vast amount of general knowledge. Fine-tuning is like focused exam prep — you drill on specific types of questions using curated study guides. You can't do exam prep without the general knowledge base, and general knowledge alone won't get you a perfect score on a specific test. Modern LLMs do both.
Data quality is the single biggest lever in LLM performance. Models trained on smaller but cleaner datasets consistently outperform those trained on larger but noisier ones. The Chinchilla scaling laws showed that most models were undertrained relative to their data budget.
Pre-training Objective
The core learning signal during pre-training. The model learns language by predicting missing or future tokens from context — no labels, no supervision, just raw text.
The model sees tokens left-to-right and predicts the next token at each position. Future tokens are masked — the model cannot peek ahead.
Causal LM (next-token prediction) is the dominant pre-training objective for modern LLMs like GPT-4, LLaMA, and Mistral. Masked LM (BERT-style) excels at understanding tasks but cannot generate text autoregressively.
Distributed Training
Training a 70B+ parameter model on a single GPU is impossible. Distributed training splits computation across hundreds or thousands of GPUs using complementary parallelism strategies.
Each GPU gets a full copy of the model but processes different data batches. Gradients are synchronized across GPUs after each step.
In practice, large training runs combine all strategies: ZeRO-3 for memory efficiency, tensor parallelism within nodes (fast NVLink), pipeline parallelism across nodes, and data parallelism for throughput. This is called 3D parallelism.
Mixed Precision & Optimization
Training in full FP32 precision is wasteful. Mixed-precision training uses lower-precision formats for most operations while keeping a master copy in FP32 for stability.
Full precision — highest accuracy, highest memory
Half precision — fast but limited range, prone to overflow
Brain float — same range as FP32, half the memory
BF16 has become the default training precision because it matches FP32's exponent range (no overflow issues) while using half the memory. Combined with loss scaling, it makes training stable at 16-bit with no accuracy loss for most architectures.
Learning Rate Schedules
The learning rate schedule controls how aggressively the model updates its weights during training. Getting this wrong can mean the difference between a working model and a diverged one.
Analogy — Marathon Training Schedule: Learning rate schedules mirror how marathon runners train. You start with easy warm-up runs (warmup phase), build to peak intensity mid-cycle (peak learning rate), then taper down before race day to let your body consolidate gains (cosine decay). Going too hard too early causes injury (divergence); never pushing hard enough means you plateau.
Most modern LLMs use a warmup of 1–5% of total steps followed by cosine decay to near-zero. The peak learning rate is one of the most sensitive hyperparameters — typically 1–5×10⁻⁴ for large models.
Instruction Tuning (SFT)
Pre-trained models can predict the next token, but they don't know how to follow instructions. Supervised fine-tuning (SFT) teaches the model to respond helpfully using curated instruction-response pairs.
Explain quantum computing in simple terms.
Quantum computing quantum mechanics quantum bits superposition entanglement quantum gates quantum algorithms quantum supremacy quantum error correction quantum decoherence quantum parallelism is a type of computation that harnesses quantum mechanical phenomena...
Quality over quantity: research from LIMA (2023) showed that fine-tuning on just 1,000 carefully curated examples can match or exceed models trained on 100K+ noisy instruction pairs. The key is diverse, high-quality demonstrations.
RLHF
Reinforcement Learning from Human Feedback aligns model behavior with human values and preferences. It bridges the gap between “can generate text” and “generates text humans actually want.”
Collect Human Preferences
Human annotators compare pairs of model responses to the same prompt and choose which one is better. These preference pairs form the training signal for the reward model.
Analogy — Sports Coaching: RLHF works like an athlete with a coach. The athlete (LLM) performs (generates responses), then the coach (reward model) scores each performance: “That answer was helpful and honest — 8/10” or “That was evasive and misleading — 2/10.” Through many rounds of perform → score → adjust, the athlete learns to consistently deliver what the coach values. The human annotators train the coach; the coach then scales to millions of evaluations.
Repeat for thousands of iterations until the model aligns with human preferences
RLHF was the key innovation that made ChatGPT feel aligned compared to base GPT-3. Without it, models generate plausible-sounding but unhelpful, harmful, or untruthful responses. The reward model acts as a proxy for human judgment at scale.
DPO & Alternatives
RLHF works, but it is complex and expensive. Direct Preference Optimization (DPO) and other alternatives simplify alignment by removing the need for a separate reward model or human annotators.
Well-studied, flexible, strong alignment signal
Complex pipeline, unstable, expensive (3 models in memory)
DPO has become the preferred alignment method for many open-source models (Zephyr, Mixtral-Instruct) because it reduces training complexity from three models to one, cuts compute costs significantly, and produces comparable alignment quality.
Evaluation & Benchmarks
How do you know if a model is actually good? Benchmarks provide standardized tests, but no single metric captures the full picture. Understanding what each benchmark measures — and its limitations — is critical.
Perplexity measures how “surprised” a model is by text. Lower is better — a perplexity of 1 means the model perfectly predicts every token.
Perplexity is useful for comparing models on the same dataset but doesn't directly measure usefulness or alignment.
No single benchmark tells the whole story. MMLU tests knowledge but not reasoning depth. HumanEval tests code but only Python. Models increasingly overfit to public benchmarks — contamination is a growing concern in the evaluation community.
Quantization & Deployment
A trained model is useless if you can't serve it. Quantization reduces model precision to shrink memory footprint and increase throughput, enabling deployment on smaller and cheaper hardware.
Near-lossless. Standard production precision.
Analogy — JPEG Quality Slider: Quantization is like the quality slider when you save a JPEG. At 100% quality (FP32), the image is pristine but the file is huge. At 80% (INT8), it's visually identical but half the size. At 50% (INT4), you notice slight artifacts in complex areas, but it's a quarter the size. For most uses, 80% is the sweet spot — and models behave the same way.
For most production deployments, INT8 quantization is the sweet spot — near-zero quality loss with 2× memory savings. INT4 (AWQ or GPTQ) is ideal when you need to fit larger models on smaller GPUs or serve on consumer hardware.