Data & Annotation Explained — Under the Hood

01

Why Data Is the Foundation

Every AI model is a reflection of its training data. Architecture innovations like transformers and attention mechanisms are critical, but none of them matter if the data underneath is flawed. The quality, diversity, and curation of training data determine the upper bound of what a model can learn.

model performance

Sentiment Accuracy94%

Factual Correctness91%

Coherence Score96%

Bias Detection88%

data → model → output pipeline

Analogy — Building a House: Data is the foundation. You can hire the best architects (researchers) and use the finest materials (GPU clusters), but if the foundation is cracked (noisy, biased data), the house will lean and eventually collapse. The best architecture in the world cannot fix bad data.

02

Data Collection at Scale

Modern LLMs are trained on trillions of tokens sourced from the open web, books, scientific papers, code repositories, and curated datasets. The composition of this training mix directly shapes the model's strengths and blind spots.

token composition

LLaMA Training Mix

Common Crawl1.0T tokens · 67%

C4175B tokens · 15%

Wikipedia45B tokens · 4.5%

GitHub Code100B tokens · 4.5%

Books45B tokens · 4.5%

ArXiv45B tokens · 2.5%

StackExchange45B tokens · 2%

Analogy — Gathering Ingredients: Training data is like gathering ingredients for a complex recipe. You need variety (web, books, code, science), freshness (not just data from 2010), and quality (no rotten ingredients). A cake made only of flour won't rise — and a model trained only on web scrapes won't reason well about math.

03

Data Cleaning & Preprocessing

Raw web data is overwhelmingly noisy. Only a small fraction of crawled content is suitable for training. The cleaning pipeline is a multi-stage funnel that discards 80–90% of the raw input, leaving behind a concentrated, high-quality corpus.

cleaning pipeline

stage details

Deduplication

Exact and near-duplicate removal using MinHash/LSH. Duplicate documents cause memorization and inflate specific viewpoints.

Data Removed

30–40% of raw data

Example

Identical news articles syndicated across 500+ domains → kept once.

Analogy — Panning for Gold: Data cleaning is like panning for gold in a river. You scoop up massive amounts of sediment (raw web data), then shake, filter, and wash away the dirt. What remains — the 10% that survives — is the concentrated value that makes training worthwhile.

04

Annotation Fundamentals

Annotation is the process of attaching human-generated labels to data. These labels are the ground truth that supervised models learn from. The type of annotation — classification, span tagging, rating, or comparison — determines what the model can be trained to do.

annotation interface

Task: Classification

"This product exceeded my expectations!"

Select label:

Analogy — Grading Papers: Annotation is like teachers grading student papers. Every teacher (annotator) has their own rubric and interpretation. One teacher might mark "mostly correct" while another says "needs work." To build reliable training data, you need consistent rubrics, calibration sessions, and multiple graders per sample.

05

Annotation for RLHF

Reinforcement Learning from Human Feedback (RLHF) depends on human raters comparing model outputs side by side. These preference annotations train a reward model that guides the policy toward generating responses humans actually prefer.

rlhf workflow

preference comparison

Prompt

Explain quantum computing to a 10-year-old.

Analogy — Wine Tasting Competition: RLHF annotation is like a blind wine tasting. Judges (raters) sample two wines (responses) without knowing which vineyard (model version) produced them, then rank their preference. Over thousands of tastings, the collective judgment forms a reliable signal for what "good" looks like.

06

Inter-Annotator Agreement

When multiple annotators label the same data, they won't always agree. Cohen's Kappa (κ) measures how much agreement exceeds what we'd expect by pure chance. Low κ signals ambiguous guidelines, subjective tasks, or insufficient annotator training.

κ calculator

Total items rated: 100

Items both annotators agree on: 80

Base rate of positive class: 50%

Cohen's Kappa0.600Substantial agreement

interpretation scale

< 0.00Less than chance

0.00 – 0.20Slight

0.21 – 0.40Fair

0.41 – 0.60Moderate

0.61 – 0.80Substantial

0.81 – 1.00Near-perfect

In practice, κ > 0.6 is considered acceptable for most NLP annotation tasks. Tasks with inherent subjectivity (sentiment, humor, toxicity) typically achieve κ between 0.4–0.7.

Analogy — Gymnastics Judges: Inter-annotator agreement is like scoring a gymnastics routine. All judges see the same performance, but their scores differ based on experience, focus, and interpretation of the rules. A routine with wildly divergent scores indicates ambiguity — either in the performance or the scoring criteria.

07

Active Learning

Annotating data is expensive. Active learning lets the model request labels for the samples it's most uncertain about — maximizing the information gained per annotation dollar spent. Instead of labeling randomly, you focus human effort where it matters most.

active learning loop

Iteration: 0Labeled: 0/20

unlabeled pool

The movie was okay I guess

0.92

Absolutely stunning performance!

0.08

I didn't hate it

0.89

Worst experience ever

0.05

Not bad, not great

0.85

Life-changing product

0.03

It works as described

0.45

Overpriced garbage

0.12

Surprisingly decent

0.78

Meh, whatever

0.88

Revolutionary technology

0.06

Could be better

0.72

Perfect in every way

0.04

A total waste of time

0.07

Pretty interesting actually

0.67

I'm not sure how I feel

0.95

Exceeded expectations

0.09

Below average quality

0.15

It's complicated

0.91

Simply the best

0.02

Analogy — The Smart Student: Active learning is like a student who, instead of re-reading the textbook cover to cover, goes straight to the practice problems they got wrong on the last quiz. By focusing study time on the hardest questions first, they improve faster with less total effort.

08

Synthetic Data & Augmentation

When real annotated data is scarce or expensive, LLMs can generate synthetic training examples. Self-Instruct, Evol-Instruct, back-translation, and constitutional self-play are core techniques. Synthetic data is powerful but requires careful quality control to avoid amplifying model artifacts.

synthetic generation

An LLM generates new instruction-response pairs from a small seed set. The model creates both the question and the answer, which are then filtered for quality.

Original

Explain photosynthesis briefly.

LLM-Generated Variants

1

Describe the process by which plants convert sunlight to energy in 2 sentences.

2

How do green plants make food from sunlight? Give a simple explanation.

3

Summarize the chemical process of photosynthesis for a middle school student.

synthetic data pipeline

Analogy — AI Tutor Practice Exams: Synthetic data is like practice exams written by an AI tutor. They're useful for drilling concepts and expanding coverage, but they're not the real exam. If the tutor has misconceptions, those will show up in the practice problems. Always validate synthetic data against ground truth.

09

Data Bias & Fairness

Training data encodes the biases of its sources. Representation imbalances, labeling inconsistencies, selection effects, and temporal skew can all cause models to produce unfair, inaccurate, or harmful outputs for specific groups or contexts.

distribution comparison

Show:

Representation Bias

Some demographics, languages, or viewpoints are over- or under-represented in the training data.

Group A60%

Group B22%

Group C12%

Group D6%

Analogy — Biased Jury Selection: Data bias is like selecting a jury that doesn't represent the community. If your jury pool (training data) over-represents one group and excludes others, the verdict (model output) may be systematically unfair — even if each juror acts in good faith. Fair AI starts with representative data.

10

Data Governance & Documentation

Responsible AI requires knowing what's inside your training data — where it came from, who labeled it, what licenses govern it, and how it changes over time. Data governance transforms opaque datasets into accountable, auditable assets.

data governance workflow

artifact details

Datasheets for Datasets

A standardized documentation template (proposed by Gebru et al., 2021) that answers: Who created this dataset? What data does it contain? How was it collected? What are its limitations?

Motivation & purpose

Composition & statistics

Collection process

Preprocessing steps

Uses & distribution

Maintenance plan

Analogy — Nutrition Labels on Food: Data governance is the nutrition label of AI. Just as food packaging must disclose ingredients, allergens, and nutritional content, datasets and models need documentation that tells you what's inside. Without it, you're feeding your model mystery ingredients — and hoping nothing goes wrong.

Data & Annotation

Why Data Is the Foundation

Clean, Curated Data

Noisy, Uncurated Data

Data Collection at Scale

LLaMA Training Mix

GPT-3 Training Mix

The Pile (EleutherAI)

LLaMA Training Mix

Data Cleaning & Preprocessing

Deduplication

Annotation Fundamentals

Classification

Named Entity Recognition

Text Quality Rating

Summarization Evaluation

Annotation for RLHF

Inter-Annotator Agreement

Active Learning

Synthetic Data & Augmentation

Self-Instruct

Evol-Instruct

Back-Translation

Constitutional AI (CAI)

Data Bias & Fairness

Representation Bias

Label Bias

Selection Bias

Temporal Bias

Representation Bias

Data Governance & Documentation

Datasheets for Datasets

Building with LLMs?