Why Not Characters? Why Not Words?

The obvious tokenization choices are characters or whitespace-split words. Both fail in practice:

  • Character-level: A 1,000-character document becomes 1,000 tokens. Transformer attention is O(N²), so context lengths explode. Relationships between characters that form a word must be learned from scratch.
  • Word-level: English has ~170,000 words. Add proper nouns, code identifiers, URLs, typos, and non-English text and you need a vocabulary of millions. Unknown tokens ("[UNK]") destroy performance on rare words. Morphologically rich languages (Finnish, Turkish) are catastrophic — every inflection of a word is a different "word."

Subword tokenization is the solution: break text into chunks that are smaller than words but larger than characters, using frequency statistics from a training corpus to decide the boundaries.

Byte Pair Encoding: The Algorithm

BPE was originally a data compression algorithm. Applied to NLP by Sennrich et al. (2016), it builds a vocabulary bottom-up by merging the most frequent adjacent pairs of symbols.

Step 1: Start with a character-level vocabulary. Add a special end-of-word marker (here: ·) to the end of each word so the model can tell where word boundaries are.

Corpus: "low lower newest widest"

Initial split (character + boundary marker):
l o w ·
l o w e r ·
n e w e s t ·
w i d e s t ·

Character vocabulary: {l, o, w, ·, e, r, n, s, t, i, d}

Step 2: Count all adjacent pairs across the corpus. Merge the most frequent pair into a new token.

Pair frequencies:
  (e, s): 2  (in "newest·", "widest·")
  (s, t): 2  (in "newest·", "widest·")
  (e, w): 1
  ...

Most frequent: (e, s) with count 2 → merge into "es"

After merge 1:
l o w ·        l o w e r ·
n e w es t ·   w i d es t ·

Step 3: Repeat until vocabulary reaches target size. After the second merge, (es, t) → "est":

After merge 2:
l o w ·        l o w e r ·
n e w est ·    w i d est ·

After many merges, common words like "the", "is", "low" become single tokens; rarer words are split into subwords. The final vocabulary is the character set plus every merged pair. GPT-4 uses a vocabulary of ~100,000 tokens via tiktoken's cl100k_base encoding.

The Whitespace Gotcha That Breaks Prompts

This is the most common misunderstanding with tiktoken specifically. The tokenizer encodes leading whitespace as part of the following token, not as a separate space token.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

enc.encode("hello")       # → [15339]        (1 token)
enc.encode(" hello")      # → [24748]        (1 token — DIFFERENT token!)
enc.encode("  hello")     # → [220, 24748]   (2 tokens)
enc.encode("Hello")       # → [9906]         (1 token — DIFFERENT again)

The practical implication: if you split text on whitespace before passing it to an API, you may break the token boundaries in ways that hurt model performance, especially for continuation tasks. Always pass raw text and let the tokenizer handle it.

⚠️ Code Indentation Is Expensive

In Python, each 4-space indentation block adds 1 token per 4 spaces. A deeply nested function with 3 levels of indentation loses 3 tokens per line just to whitespace. At 50 lines, that's 150 tokens. For code-heavy prompts, consider compressing indentation or using a more compact style.

Token Costs Across Languages

Tokenizers are trained primarily on English text. Non-Latin scripts tokenize less efficiently — the same information density requires more tokens, and thus more cost and context window usage.

LanguageSample PhraseTokens (cl100k)CharactersTokens/char
English"The weather is nice today"6260.23
French"Le temps est beau aujourd'hui"8290.28
Hindi (Devanagari)"आज मौसम अच्छा है"14170.82
Japanese"今日は天気がいいです"1291.33
Chinese (Simplified)"今天天气很好"861.33
Arabic"الطقس جميل اليوم"11170.65

Japanese and Chinese encode each character as 1–2 tokens, while English words average 1 token per 4 characters. Applications serving multilingual users should budget ~2–3× more tokens for non-Latin scripts.

Counting Tokens Before API Calls

import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    tokens_per_message = 3  # every message has role/content framing overhead
    tokens_per_name = 1
    total = 0
    for msg in messages:
        total += tokens_per_message
        for key, value in msg.items():
            total += len(enc.encode(value))
            if key == "name":
                total += tokens_per_name
    total += 3  # reply priming overhead
    return total

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Explain transformers in 100 words."}
]
print(count_tokens(messages))  # → 26 tokens before the model responds

SentencePiece vs tiktoken vs WordPiece

LibraryAlgorithmUsed ByKey Difference
tiktokenBPEOpenAI modelsByte-level BPE; every UTF-8 byte is in the base vocabulary so no [UNK] ever
SentencePieceBPE or UnigramLlama, T5, GeminiLanguage-agnostic; operates on raw text without pre-tokenization
WordPieceBPE variantBERT, DistilBERTMaximizes language model likelihood at each merge step instead of raw frequency

The Reversal Curse: When Token Direction Matters

Models struggle to reverse strings — not because of memory, but because of token directionality. The word "apple" tokenizes left-to-right as [ap|ple]. The model has never learned "elppa" as a token; reversed text is a stream of individually unfamiliar subwords. This is why asking an LLM to "spell this word backwards" triggers errors that feel nonsensical but are mechanistically predictable.

✅ Rule of Thumb

For English prose: 1 token ≈ 4 characters ≈ 0.75 words. A 4,096-token context window fits roughly 3,000 words — a long article. For code or non-English text, assume 2× the token count.