Tokenization Explained — Under the Hood

Why Tokenize?

Neural networks operate on numbers, not text. Every string must be converted into a sequence of numeric IDs before a model can process it. The way we split text into pieces — the tokenization strategy — has profound effects on what the model can learn and how efficiently it does so.

tokenization

Input text

"Hello world"

H72e101l108l108o111 32w119o111r114l108d100

Tokens: 11Strategy: Character Codes

Analogy — LEGO Bricks: Tokenization is like building with LEGO. Character-level is like using only 1×1 bricks — you can build anything, but it takes forever and the instructions are enormous. Word-level is like using pre-built walls — fast to assemble, but you can't build anything the kit didn't include. Subword tokenization is the sweet spot: a mix of small and large bricks that lets you build any structure efficiently.

The tokenizer is a fixed preprocessing step — it runs before the neural network sees anything. A bad tokenizer can cripple even the most powerful model. Getting it right is one of the first critical decisions in LLM design.

Character vs Word vs Subword

Three fundamental approaches to splitting text into tokens. Each trades off vocabulary size, sequence length, and the ability to handle words never seen during training.

comparison

Property	Character	Word	Subword
Vocabulary Size	~256 (bytes/chars)	100K–500K+	32K–100K
Sequence Length	Very long	Short	Moderate
OOV Handling	None needed	Poor (UNK token)	Graceful (splits unseen words)
Semantic Meaning	None per token	Full per token	Partial morpheme meaning
Training Efficiency	Slow (long sequences)	Fast but brittle	Balanced
Used By	ByT5, some char-CNNs	Early NLP (pre-2016)	GPT, BERT, LLaMA, Claude

"unhappiness" → subword

Tokens:

un348happiness58399

Sequence length: 2

Subword tokenization is the dominant approach in modern LLMs because it elegantly balances vocabulary size with sequence length. It can represent any word — even completely novel ones — by composing it from smaller, known pieces.

Animated — Same Text, Three Strategies

Input: “unbelievable” →

unbelievable

Tokens: 12•Maximum granularity, very long sequences

Byte-Pair Encoding (BPE)

BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token. This simple, greedy algorithm builds a vocabulary that naturally captures common subwords and morphemes.

interactive bpe merge

Step:

low ×5low

lower ×2lower

newest ×6newest

widest ×3widest

Vocabulary (10 tokens):

lowernstid

Analogy — Learning a Language: BPE learns to tokenize the same way a child learns to read. First you know individual letters (characters). Then you notice “th” always appears together, so you start recognizing it as a unit. Then “the” becomes a single chunk. Eventually common words become instant recognition, while rare words you still sound out letter by letter. BPE's merge rules are exactly this process, automated.

GPT-2, GPT-3, and GPT-4 all use BPE. The algorithm is deterministic — given the same training corpus, you always get the same merge rules. Those rules are saved and applied identically at inference time.

SentencePiece & WordPiece

BPE is not the only game in town. SentencePiece treats text as a raw byte stream (no pre-tokenization needed), and offers both BPE and Unigram algorithms. WordPiece, used by BERT, takes a likelihood-based approach.

BPE (SentencePiece variant)

Method

Bottom-up merges

How It Works

Start with characters, greedily merge the most frequent pair at each step. Same core idea as standard BPE, but applied directly on raw text (byte-level) without pre-tokenization.

Pre-Tokenization

Not needed — operates on raw byte stream

Used By

LLaMAGPT-NeoXMistralT5 (optionally)

Pros

Simple, deterministic, well-understood

Cons

Greedy — may miss globally optimal segmentations

SentencePiece's key innovation is treating the input as a raw byte stream with no language-specific pre-processing. This makes it truly language-agnostic — the same algorithm handles English, Chinese, Arabic, and code without any special rules.

Vocabulary Size Tradeoffs

How big should the vocabulary be? It's a core architectural decision — larger vocabularies mean shorter sequences but bigger embedding tables. There is no universal right answer; it depends on your compute budget, target languages, and context window goals.

vocabulary size explorer

Vocabulary Size32,000 tokens

8K32K50K100K

Sequence Length

~1.0× (baseline)

Embedding Params

~98M (at d=3072)

Rare Word Handling

Good — most words covered

Used By

LLaMA 1 & 2 (32K)

The sweet spot used by many open-source models. Enough tokens to cover common English words whole, while keeping the embedding table manageable.

embedding table size

~25M

32K

~98M

50K

~154M

100K

~307M

relative sequence length

~1.8× longer

32K

~1.0× (baseline)

50K

~0.9×

100K

~0.75×

Analogy — Morse Code Efficiency: Vocabulary size tradeoffs mirror the design of Morse code. The most common letter in English (“E”) gets the shortest code: a single dot. Rare letters like “Q” get longer codes. BPE does the same thing: common words become single tokens (short codes), while rare words are split into pieces (longer codes). The goal is the same — minimize total transmission length.

The embedding table is the first layer of the model — it converts token IDs to dense vectors. With a 100K vocabulary and 3072-dim embeddings, that's 307M parameters just for the lookup table. At GPT-4 scale (~1.8T params), this is a small fraction. At 7B scale, it's significant.

The Tokenizer in Action

See how a real tokenizer breaks text into subword pieces. Each colored span represents one token — notice how common words stay whole, while rare or compound words are split into meaningful subparts.

tokenizer output

Input text (with token boundaries)

Hello, world!

Token chips

Hello9906,11 world1917!0

Characters: 13Tokens: 4Compression: 3.3× chars/token

On average, one GPT-4 token is about 4 characters of English text. But this ratio varies wildly: common words like "the" are single tokens, while rare technical terms may be split into 3–5 pieces. Whitespace is typically attached to the following token (the "Ġ" prefix convention).

Multilingual Challenges

The same semantic content costs a different number of tokens depending on the language. This "fertility rate" — tokens per word — means non-English users consume their context window faster and pay more per API call.

English tokenization

Original text

Machine learning is transforming the world

Tokens (7)

Machine learning is transforming the world

Tokens: 7Fertility: 1.17× tokens/word

English is the dominant training language — most words are single tokens.

fertility comparison

🇺🇸 English

1.17×

🇨🇳 Chinese

2.33×

🇸🇦 Arabic

2.5×

🇰🇷 Korean

1.83×

This is a fairness issue. A 128K context window holds ~100K words of English but only ~40K–50K words of Chinese or Arabic. API pricing per token also means non-English users effectively pay 2–3× more for the same semantic content. Training on more multilingual data and expanding the vocabulary can reduce but not eliminate this gap.

Special Tokens

Beyond content tokens, every model uses special control tokens that carry no semantic meaning but are critical for the model to understand structure — where turns begin and end, what role is speaking, and when to stop generating.

special token catalog

chat template with special tokens

A complete chat turn — special tokens in color, content in white

<|bos|><|system|>You are a helpful assistant.<|user|>What is tokenization?<|assistant|>Tokenization is the process of...<|eos|>

The model never "sees" the chat template as text — these special tokens are injected by the tokenizer based on the chat format. Different models use different templates (ChatML, Llama-style, etc.).

Special tokens are added to the vocabulary during training but never appear in normal text. They are the control plane of the model — analogous to HTTP headers vs. body. If you corrupt the special token structure, the model's behavior becomes unpredictable.

Tokenization Artifacts

Tokenization introduces subtle failure modes. Because the model operates on tokens — not characters or words — certain patterns break in unintuitive ways. Understanding these artifacts is key to writing effective prompts and debugging unexpected model behavior.

artifact: arithmetic failures

Input

123 + 456 = ?

Tokens

123 + 456 = ?

"123" is a single token, so the model doesn't see individual digits. It can't do digit-by-digit addition the way humans do. Multi-digit arithmetic is especially hard because the model has no access to place values within a merged number token.

Analogy — Jigsaw Puzzle: Tokenization artifacts are like a jigsaw puzzle cut in the wrong places. If you cut through the middle of a face, each piece shows half a nose — neither piece makes sense alone. When a tokenizer splits “strawberry” into “str” + “awberry”, the model can't easily count the r's because the letters are scattered across different pieces. The model has to learn to “see through” the cuts — and sometimes it can't.

Many "LLM failures" are actually tokenization failures. When a model struggles with counting characters, reversing strings, or basic arithmetic, the root cause is often that the tokenizer has merged the characters into opaque chunks that hide the internal structure.

Byte-Level Fallback

Modern tokenizers guarantee they can encode any input — even characters never seen in training — by falling back to raw UTF-8 byte tokens. This "byte-level fallback" is the safety net that ensures no input is ever rejected.

tiktoken (gpt-4)

Byte-Level BPE

Operates on UTF-8 bytes from the start. The base vocabulary includes all 256 possible byte values, then BPE merges are applied on top. Any byte sequence is representable.

GPT-4GPT-3.5Claude 3

sentencepiece

Byte Fallback Mode

Normally operates on Unicode codepoints. When a character isn't in the vocabulary, SentencePiece can decompose it into UTF-8 bytes as special <0xNN> tokens. Enabled via the --byte_fallback flag.

LLaMAMistralT5

byte-level encoding

Input: 🤖🧠Method: Byte-level BPE

Byte tokens (8)

<0xF0><0x9F><0xA4><0x96><0xF0><0x9F><0xA7><0xA0>

Emojis are encoded as their raw UTF-8 bytes. Each emoji is 4 bytes, so 2 emojis = 8 byte tokens. The model can still process them because the byte tokens have trained embeddings.

Byte-level fallback is why modern tokenizers are "closed under UTF-8" — there is no possible input string that cannot be tokenized. This was not true of earlier word-level tokenizers that would produce an UNK (unknown) token for any word outside their vocabulary.

Tokenization

Why Tokenize?

Character Codes

Word IDs

Subword Tokens

Character vs Word vs Subword

Byte-Pair Encoding (BPE)

SentencePiece & WordPiece

Method

How It Works

Pre-Tokenization

Used By

Pros

Cons

Vocabulary Size Tradeoffs

The Tokenizer in Action

Multilingual Challenges

Special Tokens

Tokenization Artifacts

Byte-Level Fallback

Byte-Level BPE

Byte Fallback Mode

Building with LLMs?