Back to How LLMs Work
UNDER THE HOOD

Tokenization

The critical first step in every LLM pipeline — converting raw text into the numeric sequences that neural networks can process. From BPE to byte-level fallbacks, this is the interactive guide to how modern tokenizers work and why their design matters.

20 min readBy Hammad Abbasi Interactive
01

Why Tokenize?

Neural networks operate on numbers, not text. Every string must be converted into a sequence of numeric IDs before a model can process it. The way we split text into pieces — the tokenization strategy — has profound effects on what the model can learn and how efficiently it does so.

tokenization
Input text

"Hello world"

H72e101l108l108o111 32w119o111r114l108d100
Tokens: 11Strategy: Character Codes

Analogy — LEGO Bricks: Tokenization is like building with LEGO. Character-level is like using only 1×1 bricks — you can build anything, but it takes forever and the instructions are enormous. Word-level is like using pre-built walls — fast to assemble, but you can't build anything the kit didn't include. Subword tokenization is the sweet spot: a mix of small and large bricks that lets you build any structure efficiently.

The tokenizer is a fixed preprocessing step — it runs before the neural network sees anything. A bad tokenizer can cripple even the most powerful model. Getting it right is one of the first critical decisions in LLM design.

02

Character vs Word vs Subword

Three fundamental approaches to splitting text into tokens. Each trades off vocabulary size, sequence length, and the ability to handle words never seen during training.

comparison
PropertyCharacterWordSubword
Vocabulary Size~256 (bytes/chars)100K–500K+32K–100K
Sequence LengthVery longShortModerate
OOV HandlingNone neededPoor (UNK token)Graceful (splits unseen words)
Semantic MeaningNone per tokenFull per tokenPartial morpheme meaning
Training EfficiencySlow (long sequences)Fast but brittleBalanced
Used ByByT5, some char-CNNsEarly NLP (pre-2016)GPT, BERT, LLaMA, Claude
"unhappiness" → subword
Tokens:
un348happiness58399
Sequence length: 2

Subword tokenization is the dominant approach in modern LLMs because it elegantly balances vocabulary size with sequence length. It can represent any word — even completely novel ones — by composing it from smaller, known pieces.

Animated — Same Text, Three Strategies
Input: “unbelievable” →
unbelievable
Tokens: 12Maximum granularity, very long sequences
03

Byte-Pair Encoding (BPE)

BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token. This simple, greedy algorithm builds a vocabulary that naturally captures common subwords and morphemes.

interactive bpe merge
Step:
low ×5low
lower ×2lower
newest ×6newest
widest ×3widest
Vocabulary (10 tokens):
lowernstid

Analogy — Learning a Language: BPE learns to tokenize the same way a child learns to read. First you know individual letters (characters). Then you notice “th” always appears together, so you start recognizing it as a unit. Then “the” becomes a single chunk. Eventually common words become instant recognition, while rare words you still sound out letter by letter. BPE's merge rules are exactly this process, automated.

GPT-2, GPT-3, and GPT-4 all use BPE. The algorithm is deterministic — given the same training corpus, you always get the same merge rules. Those rules are saved and applied identically at inference time.

04

SentencePiece & WordPiece

BPE is not the only game in town. SentencePiece treats text as a raw byte stream (no pre-tokenization needed), and offers both BPE and Unigram algorithms. WordPiece, used by BERT, takes a likelihood-based approach.

BPE (SentencePiece variant)

Method

Bottom-up merges

How It Works

Start with characters, greedily merge the most frequent pair at each step. Same core idea as standard BPE, but applied directly on raw text (byte-level) without pre-tokenization.

Pre-Tokenization

Not needed — operates on raw byte stream

Used By

LLaMAGPT-NeoXMistralT5 (optionally)

Pros

Simple, deterministic, well-understood

Cons

Greedy — may miss globally optimal segmentations

SentencePiece's key innovation is treating the input as a raw byte stream with no language-specific pre-processing. This makes it truly language-agnostic — the same algorithm handles English, Chinese, Arabic, and code without any special rules.

05

Vocabulary Size Tradeoffs

How big should the vocabulary be? It's a core architectural decision — larger vocabularies mean shorter sequences but bigger embedding tables. There is no universal right answer; it depends on your compute budget, target languages, and context window goals.

vocabulary size explorer
Vocabulary Size32,000 tokens
8K32K50K100K
Sequence Length

~1.0× (baseline)

Embedding Params

~98M (at d=3072)

Rare Word Handling

Good — most words covered

Used By

LLaMA 1 & 2 (32K)

The sweet spot used by many open-source models. Enough tokens to cover common English words whole, while keeping the embedding table manageable.

embedding table size
8K
~25M
32K
~98M
50K
~154M
100K
~307M
relative sequence length
8K
~1.8× longer
32K
~1.0× (baseline)
50K
~0.9×
100K
~0.75×

Analogy — Morse Code Efficiency: Vocabulary size tradeoffs mirror the design of Morse code. The most common letter in English (“E”) gets the shortest code: a single dot. Rare letters like “Q” get longer codes. BPE does the same thing: common words become single tokens (short codes), while rare words are split into pieces (longer codes). The goal is the same — minimize total transmission length.

The embedding table is the first layer of the model — it converts token IDs to dense vectors. With a 100K vocabulary and 3072-dim embeddings, that's 307M parameters just for the lookup table. At GPT-4 scale (~1.8T params), this is a small fraction. At 7B scale, it's significant.

06

The Tokenizer in Action

See how a real tokenizer breaks text into subword pieces. Each colored span represents one token — notice how common words stay whole, while rare or compound words are split into meaningful subparts.

tokenizer output
Input text (with token boundaries)

Hello, world!

Token chips
Hello9906,11 world1917!0
Characters: 13Tokens: 4Compression: 3.3× chars/token

On average, one GPT-4 token is about 4 characters of English text. But this ratio varies wildly: common words like "the" are single tokens, while rare technical terms may be split into 3–5 pieces. Whitespace is typically attached to the following token (the "Ġ" prefix convention).

07

Multilingual Challenges

The same semantic content costs a different number of tokens depending on the language. This "fertility rate" — tokens per word — means non-English users consume their context window faster and pay more per API call.

English tokenization
Original text

Machine learning is transforming the world

Tokens (7)
Machine learning is transforming the world
Tokens: 7Fertility: 1.17× tokens/word

English is the dominant training language — most words are single tokens.

fertility comparison
🇺🇸 English
1.17×
🇨🇳 Chinese
2.33×
🇸🇦 Arabic
2.5×
🇰🇷 Korean
1.83×

This is a fairness issue. A 128K context window holds ~100K words of English but only ~40K–50K words of Chinese or Arabic. API pricing per token also means non-English users effectively pay 2–3× more for the same semantic content. Training on more multilingual data and expanding the vocabulary can reduce but not eliminate this gap.

08

Special Tokens

Beyond content tokens, every model uses special control tokens that carry no semantic meaning but are critical for the model to understand structure — where turns begin and end, what role is speaking, and when to stop generating.

special token catalog
chat template with special tokens
A complete chat turn — special tokens in color, content in white
<|bos|><|system|>You are a helpful assistant.<|user|>What is tokenization?<|assistant|>Tokenization is the process of...<|eos|>

The model never "sees" the chat template as text — these special tokens are injected by the tokenizer based on the chat format. Different models use different templates (ChatML, Llama-style, etc.).

Special tokens are added to the vocabulary during training but never appear in normal text. They are the control plane of the model — analogous to HTTP headers vs. body. If you corrupt the special token structure, the model's behavior becomes unpredictable.

09

Tokenization Artifacts

Tokenization introduces subtle failure modes. Because the model operates on tokens — not characters or words — certain patterns break in unintuitive ways. Understanding these artifacts is key to writing effective prompts and debugging unexpected model behavior.

artifact: arithmetic failures
Input

123 + 456 = ?

Tokens
123 + 456 = ?

"123" is a single token, so the model doesn't see individual digits. It can't do digit-by-digit addition the way humans do. Multi-digit arithmetic is especially hard because the model has no access to place values within a merged number token.

Analogy — Jigsaw Puzzle: Tokenization artifacts are like a jigsaw puzzle cut in the wrong places. If you cut through the middle of a face, each piece shows half a nose — neither piece makes sense alone. When a tokenizer splits “strawberry” into “str” + “awberry”, the model can't easily count the r's because the letters are scattered across different pieces. The model has to learn to “see through” the cuts — and sometimes it can't.

Many "LLM failures" are actually tokenization failures. When a model struggles with counting characters, reversing strings, or basic arithmetic, the root cause is often that the tokenizer has merged the characters into opaque chunks that hide the internal structure.

10

Byte-Level Fallback

Modern tokenizers guarantee they can encode any input — even characters never seen in training — by falling back to raw UTF-8 byte tokens. This "byte-level fallback" is the safety net that ensures no input is ever rejected.

tiktoken (gpt-4)

Byte-Level BPE

Operates on UTF-8 bytes from the start. The base vocabulary includes all 256 possible byte values, then BPE merges are applied on top. Any byte sequence is representable.

GPT-4GPT-3.5Claude 3
sentencepiece

Byte Fallback Mode

Normally operates on Unicode codepoints. When a character isn't in the vocabulary, SentencePiece can decompose it into UTF-8 bytes as special <0xNN> tokens. Enabled via the --byte_fallback flag.

LLaMAMistralT5
byte-level encoding
Input: 🤖🧠Method: Byte-level BPE
Byte tokens (8)
<0xF0><0x9F><0xA4><0x96><0xF0><0x9F><0xA7><0xA0>

Emojis are encoded as their raw UTF-8 bytes. Each emoji is 4 bytes, so 2 emojis = 8 byte tokens. The model can still process them because the byte tokens have trained embeddings.

Byte-level fallback is why modern tokenizers are "closed under UTF-8" — there is no possible input string that cannot be tokenized. This was not true of earlier word-level tokenizers that would produce an UNK (unknown) token for any word outside their vocabulary.

Building with LLMs?

I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.

Get In Touch