Why Not Characters? Why Not Words?
The obvious tokenization choices are characters or whitespace-split words. Both fail in practice:
- Character-level: A 1,000-character document becomes 1,000 tokens. Transformer attention is O(N²), so context lengths explode. Relationships between characters that form a word must be learned from scratch.
- Word-level: English has ~170,000 words. Add proper nouns, code identifiers, URLs, typos, and non-English text and you need a vocabulary of millions. Unknown tokens ("
[UNK]") destroy performance on rare words. Morphologically rich languages (Finnish, Turkish) are catastrophic — every inflection of a word is a different "word."
Subword tokenization is the solution: break text into chunks that are smaller than words but larger than characters, using frequency statistics from a training corpus to decide the boundaries.
Byte Pair Encoding: The Algorithm
BPE was originally a data compression algorithm. Applied to NLP by Sennrich et al. (2016), it builds a vocabulary bottom-up by merging the most frequent adjacent pairs of symbols.
Step 1: Start with a character-level vocabulary. Add a special end-of-word marker (here: ·) to the end of each word so the model can tell where word boundaries are.
Corpus: "low lower newest widest"
Initial split (character + boundary marker):
l o w ·
l o w e r ·
n e w e s t ·
w i d e s t ·
Character vocabulary: {l, o, w, ·, e, r, n, s, t, i, d}Step 2: Count all adjacent pairs across the corpus. Merge the most frequent pair into a new token.
Pair frequencies:
(e, s): 2 (in "newest·", "widest·")
(s, t): 2 (in "newest·", "widest·")
(e, w): 1
...
Most frequent: (e, s) with count 2 → merge into "es"
After merge 1:
l o w · l o w e r ·
n e w es t · w i d es t ·Step 3: Repeat until vocabulary reaches target size. After the second merge, (es, t) → "est":
After merge 2:
l o w · l o w e r ·
n e w est · w i d est ·After many merges, common words like "the", "is", "low" become single tokens; rarer words are split into subwords. The final vocabulary is the character set plus every merged pair. GPT-4 uses a vocabulary of ~100,000 tokens via tiktoken's cl100k_base encoding.
The Whitespace Gotcha That Breaks Prompts
This is the most common misunderstanding with tiktoken specifically. The tokenizer encodes leading whitespace as part of the following token, not as a separate space token.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
enc.encode("hello") # → [15339] (1 token)
enc.encode(" hello") # → [24748] (1 token — DIFFERENT token!)
enc.encode(" hello") # → [220, 24748] (2 tokens)
enc.encode("Hello") # → [9906] (1 token — DIFFERENT again)The practical implication: if you split text on whitespace before passing it to an API, you may break the token boundaries in ways that hurt model performance, especially for continuation tasks. Always pass raw text and let the tokenizer handle it.
In Python, each 4-space indentation block adds 1 token per 4 spaces. A deeply nested function with 3 levels of indentation loses 3 tokens per line just to whitespace. At 50 lines, that's 150 tokens. For code-heavy prompts, consider compressing indentation or using a more compact style.
Token Costs Across Languages
Tokenizers are trained primarily on English text. Non-Latin scripts tokenize less efficiently — the same information density requires more tokens, and thus more cost and context window usage.
| Language | Sample Phrase | Tokens (cl100k) | Characters | Tokens/char |
|---|---|---|---|---|
| English | "The weather is nice today" | 6 | 26 | 0.23 |
| French | "Le temps est beau aujourd'hui" | 8 | 29 | 0.28 |
| Hindi (Devanagari) | "आज मौसम अच्छा है" | 14 | 17 | 0.82 |
| Japanese | "今日は天気がいいです" | 12 | 9 | 1.33 |
| Chinese (Simplified) | "今天天气很好" | 8 | 6 | 1.33 |
| Arabic | "الطقس جميل اليوم" | 11 | 17 | 0.65 |
Japanese and Chinese encode each character as 1–2 tokens, while English words average 1 token per 4 characters. Applications serving multilingual users should budget ~2–3× more tokens for non-Latin scripts.
Counting Tokens Before API Calls
import tiktoken
def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
tokens_per_message = 3 # every message has role/content framing overhead
tokens_per_name = 1
total = 0
for msg in messages:
total += tokens_per_message
for key, value in msg.items():
total += len(enc.encode(value))
if key == "name":
total += tokens_per_name
total += 3 # reply priming overhead
return total
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in 100 words."}
]
print(count_tokens(messages)) # → 26 tokens before the model respondsSentencePiece vs tiktoken vs WordPiece
| Library | Algorithm | Used By | Key Difference |
|---|---|---|---|
| tiktoken | BPE | OpenAI models | Byte-level BPE; every UTF-8 byte is in the base vocabulary so no [UNK] ever |
| SentencePiece | BPE or Unigram | Llama, T5, Gemini | Language-agnostic; operates on raw text without pre-tokenization |
| WordPiece | BPE variant | BERT, DistilBERT | Maximizes language model likelihood at each merge step instead of raw frequency |
The Reversal Curse: When Token Direction Matters
Models struggle to reverse strings — not because of memory, but because of token directionality. The word "apple" tokenizes left-to-right as [ap|ple]. The model has never learned "elppa" as a token; reversed text is a stream of individually unfamiliar subwords. This is why asking an LLM to "spell this word backwards" triggers errors that feel nonsensical but are mechanistically predictable.
For English prose: 1 token ≈ 4 characters ≈ 0.75 words. A 4,096-token context window fits roughly 3,000 words — a long article. For code or non-English text, assume 2× the token count.