Why Tokenize?
Neural networks operate on numbers, not text. Every string must be converted into a sequence of numeric IDs before a model can process it. The way we split text into pieces — the tokenization strategy — has profound effects on what the model can learn and how efficiently it does so.
"Hello world"
Analogy — LEGO Bricks: Tokenization is like building with LEGO. Character-level is like using only 1×1 bricks — you can build anything, but it takes forever and the instructions are enormous. Word-level is like using pre-built walls — fast to assemble, but you can't build anything the kit didn't include. Subword tokenization is the sweet spot: a mix of small and large bricks that lets you build any structure efficiently.
The tokenizer is a fixed preprocessing step — it runs before the neural network sees anything. A bad tokenizer can cripple even the most powerful model. Getting it right is one of the first critical decisions in LLM design.
Character vs Word vs Subword
Three fundamental approaches to splitting text into tokens. Each trades off vocabulary size, sequence length, and the ability to handle words never seen during training.
| Property | Character | Word | Subword |
|---|---|---|---|
| Vocabulary Size | ~256 (bytes/chars) | 100K–500K+ | 32K–100K |
| Sequence Length | Very long | Short | Moderate |
| OOV Handling | None needed | Poor (UNK token) | Graceful (splits unseen words) |
| Semantic Meaning | None per token | Full per token | Partial morpheme meaning |
| Training Efficiency | Slow (long sequences) | Fast but brittle | Balanced |
| Used By | ByT5, some char-CNNs | Early NLP (pre-2016) | GPT, BERT, LLaMA, Claude |
Subword tokenization is the dominant approach in modern LLMs because it elegantly balances vocabulary size with sequence length. It can represent any word — even completely novel ones — by composing it from smaller, known pieces.
Byte-Pair Encoding (BPE)
BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token. This simple, greedy algorithm builds a vocabulary that naturally captures common subwords and morphemes.
Analogy — Learning a Language: BPE learns to tokenize the same way a child learns to read. First you know individual letters (characters). Then you notice “th” always appears together, so you start recognizing it as a unit. Then “the” becomes a single chunk. Eventually common words become instant recognition, while rare words you still sound out letter by letter. BPE's merge rules are exactly this process, automated.
GPT-2, GPT-3, and GPT-4 all use BPE. The algorithm is deterministic — given the same training corpus, you always get the same merge rules. Those rules are saved and applied identically at inference time.
SentencePiece & WordPiece
BPE is not the only game in town. SentencePiece treats text as a raw byte stream (no pre-tokenization needed), and offers both BPE and Unigram algorithms. WordPiece, used by BERT, takes a likelihood-based approach.
Method
Bottom-up merges
How It Works
Start with characters, greedily merge the most frequent pair at each step. Same core idea as standard BPE, but applied directly on raw text (byte-level) without pre-tokenization.
Pre-Tokenization
Not needed — operates on raw byte stream
Used By
Pros
Simple, deterministic, well-understood
Cons
Greedy — may miss globally optimal segmentations
SentencePiece's key innovation is treating the input as a raw byte stream with no language-specific pre-processing. This makes it truly language-agnostic — the same algorithm handles English, Chinese, Arabic, and code without any special rules.
Vocabulary Size Tradeoffs
How big should the vocabulary be? It's a core architectural decision — larger vocabularies mean shorter sequences but bigger embedding tables. There is no universal right answer; it depends on your compute budget, target languages, and context window goals.
~1.0× (baseline)
~98M (at d=3072)
Good — most words covered
LLaMA 1 & 2 (32K)
The sweet spot used by many open-source models. Enough tokens to cover common English words whole, while keeping the embedding table manageable.
Analogy — Morse Code Efficiency: Vocabulary size tradeoffs mirror the design of Morse code. The most common letter in English (“E”) gets the shortest code: a single dot. Rare letters like “Q” get longer codes. BPE does the same thing: common words become single tokens (short codes), while rare words are split into pieces (longer codes). The goal is the same — minimize total transmission length.
The embedding table is the first layer of the model — it converts token IDs to dense vectors. With a 100K vocabulary and 3072-dim embeddings, that's 307M parameters just for the lookup table. At GPT-4 scale (~1.8T params), this is a small fraction. At 7B scale, it's significant.
The Tokenizer in Action
See how a real tokenizer breaks text into subword pieces. Each colored span represents one token — notice how common words stay whole, while rare or compound words are split into meaningful subparts.
Hello, world!
On average, one GPT-4 token is about 4 characters of English text. But this ratio varies wildly: common words like "the" are single tokens, while rare technical terms may be split into 3–5 pieces. Whitespace is typically attached to the following token (the "Ġ" prefix convention).
Multilingual Challenges
The same semantic content costs a different number of tokens depending on the language. This "fertility rate" — tokens per word — means non-English users consume their context window faster and pay more per API call.
Machine learning is transforming the world
English is the dominant training language — most words are single tokens.
This is a fairness issue. A 128K context window holds ~100K words of English but only ~40K–50K words of Chinese or Arabic. API pricing per token also means non-English users effectively pay 2–3× more for the same semantic content. Training on more multilingual data and expanding the vocabulary can reduce but not eliminate this gap.
Special Tokens
Beyond content tokens, every model uses special control tokens that carry no semantic meaning but are critical for the model to understand structure — where turns begin and end, what role is speaking, and when to stop generating.
The model never "sees" the chat template as text — these special tokens are injected by the tokenizer based on the chat format. Different models use different templates (ChatML, Llama-style, etc.).
Special tokens are added to the vocabulary during training but never appear in normal text. They are the control plane of the model — analogous to HTTP headers vs. body. If you corrupt the special token structure, the model's behavior becomes unpredictable.
Tokenization Artifacts
Tokenization introduces subtle failure modes. Because the model operates on tokens — not characters or words — certain patterns break in unintuitive ways. Understanding these artifacts is key to writing effective prompts and debugging unexpected model behavior.
123 + 456 = ?
"123" is a single token, so the model doesn't see individual digits. It can't do digit-by-digit addition the way humans do. Multi-digit arithmetic is especially hard because the model has no access to place values within a merged number token.
Analogy — Jigsaw Puzzle: Tokenization artifacts are like a jigsaw puzzle cut in the wrong places. If you cut through the middle of a face, each piece shows half a nose — neither piece makes sense alone. When a tokenizer splits “strawberry” into “str” + “awberry”, the model can't easily count the r's because the letters are scattered across different pieces. The model has to learn to “see through” the cuts — and sometimes it can't.
Many "LLM failures" are actually tokenization failures. When a model struggles with counting characters, reversing strings, or basic arithmetic, the root cause is often that the tokenizer has merged the characters into opaque chunks that hide the internal structure.
Byte-Level Fallback
Modern tokenizers guarantee they can encode any input — even characters never seen in training — by falling back to raw UTF-8 byte tokens. This "byte-level fallback" is the safety net that ensures no input is ever rejected.
Byte-Level BPE
Operates on UTF-8 bytes from the start. The base vocabulary includes all 256 possible byte values, then BPE merges are applied on top. Any byte sequence is representable.
Byte Fallback Mode
Normally operates on Unicode codepoints. When a character isn't in the vocabulary, SentencePiece can decompose it into UTF-8 bytes as special <0xNN> tokens. Enabled via the --byte_fallback flag.
Emojis are encoded as their raw UTF-8 bytes. Each emoji is 4 bytes, so 2 emojis = 8 byte tokens. The model can still process them because the byte tokens have trained embeddings.
Byte-level fallback is why modern tokenizers are "closed under UTF-8" — there is no possible input string that cannot be tokenized. This was not true of earlier word-level tokenizers that would produce an UNK (unknown) token for any word outside their vocabulary.