Part 3 · Inside Large Language Models

Chapter 11Tokenization: How Text Becomes Tokens

⏱ 10 min read·✏️ 6 exercises·🖼 1 figure·Inside Large Language Models

In Chapter 10 we said a model first splits text into *tokens*, then waved our hands and moved on. It is time to lift the lid on that very first step, because it is far more consequential than it appears. Tokenization quietly determines how much you pay, how much text a model can handle at once, and why models sometimes fail at tasks that look trivially easy to a human. This is a longer chapter than its humble subject suggests, and deliberately so: understanding tokenization will save you money, prevent confusion, and demystify a whole category of strange model behavior. We take it slowly, with plenty of examples, and assume no prior knowledge.

The Step Before Everything Else

Recall the journey from Chapter 10: text becomes tokens, tokens become embeddings, embeddings flow through the transformer, and the model predicts the next token. Every other chapter has focused on what happens after the text is tokenized. This chapter is about that opening move — how a string of human characters becomes the units a model actually works with.

A token is the basic unit of text a model reads and writes. It is not quite a word and not quite a letter; it sits somewhere in between, and understanding why it sits there is the key to this whole chapter. To appreciate the clever solution that real models use, it helps to first see the two obvious approaches and watch them both fail.

Why Not Just Use Words?

The most natural idea is to make each word a token. It feels right — humans think in words, so why not models? But it breaks down quickly under the messiness of real language.

The trouble with one token per word

First, the vocabulary explodes. English alone has hundreds of thousands of words, and that is before you count names, places, brands, and technical terms. Add other languages and the number balloons into the millions. A model would need a separate entry for every one.

Second, and worse, language is endlessly inventive. People coin new words, make typos, mash words together, and write in countless languages. What happens when the model meets a word that was not in its vocabulary — a brand-new product name, a misspelling, a piece of slang? With a pure word vocabulary, the model simply has no token for it, and is stuck. It cannot represent what it has never catalogued.

The trouble with one token per character

So perhaps we should go the other way and make each character a token. Now the vocabulary is tiny — just the letters, digits, and punctuation — and we can represent any text at all, since everything is made of characters. Problem solved? Not quite.

The catch is length. A single page of text becomes thousands of character-tokens, and remember from Chapter 10 that attention compares every token with every other token. Longer sequences mean dramatically more work. Worse, the model would have to spend enormous effort relearning, from scratch, that the characters c-a-t together mean a small furry animal — a fact that a word-level approach gets for free. Characters are flexible but painfully inefficient.

The Goldilocks Solution: Subword Tokens

Real models use a solution that sits neatly between the two extremes: subword tokenization. Tokens are pieces of words — bigger than characters, often smaller than whole words. Common words become a single token, while rare or unusual words are broken into a few meaningful pieces.

Consider the word "unbelievable." A subword tokenizer might split it into un + believ + able. Notice how sensible those pieces are: un is a common prefix, able a common suffix, and the model can reuse them across thousands of other words (unhappy, readable, and so on). A word the model has never seen before can still be represented by stitching together familiar pieces. This is the best of both worlds: efficiency for common words, and the ability to handle anything at all by falling back on smaller chunks.

Figure 11.1 — The sentence "Tokenization is unbelievably useful" split into subword tokens, with common words kept whole and rarer words broken into reusable pieces.

How a Tokenizer Is Built

Where do these subword pieces come from? They are not chosen by hand; they are learned from data, using a clever and surprisingly simple procedure. The most common one is called byte-pair encoding, or BPE, and its idea can be understood without any mathematics.

BPE starts by treating every text as a sequence of individual characters. Then it repeats one move over and over: find the pair of adjacent tokens that occurs most frequently across all the training text, and merge that pair into a single new token. Do this thousands of times, and the most common letter combinations gradually fuse into larger and larger chunks.

Imagine a tiny corpus full of the words "low," "lower," and "lowest." Early on, the pair l and o appears constantly, so it merges into lo. Next, lo and w are often adjacent, so they merge into low. After enough merges, low is a single token, while the endings er and est become their own tokens too. The tokenizer has discovered, purely from frequency, that low, er, and est are useful building blocks. Run this over a vast amount of real text and you end up with a vocabulary of the most useful chunks in the language — common words whole, rarer words splittable into familiar parts.

Seeing Tokenization in Action

Enough theory — let us watch it work. The exact library differs by model, but they all expose the same two operations: turn text into tokens, and count how many there are.

python

text = "Tokenization is unbelievably useful!"

tokens = tokenizer.tokenize(text)
print(tokens)
# Something like:
# ['Token', 'ization', ' is', ' un', 'believ', 'ably', ' useful', '!']

print(len(tokens))    # the number of tokens -- not the number of words!

Two things are worth noticing. The common word "is" stays whole, while the rarer "unbelievably" shatters into pieces. And the token count is higher than the word count — four words became roughly eight tokens. That gap between words and tokens is about to become very practical.

Tokens, Costs, and Context Limits

Here is where tokenization stops being academic and starts affecting your wallet and your designs. Two of the most important practical facts about working with language models are measured in tokens, not words.

First, you are billed per token. When you call a hosted model, the price is based on how many tokens go in (your prompt) and how many come out (the response). A useful rule of thumb for English is that one token is roughly three-quarters of a word, or about four characters. So a 750-word document is in the neighborhood of 1,000 tokens.

python

words = 750
estimated_tokens = int(words / 0.75)   # ~1 token per 0.75 English words
print(estimated_tokens)                # roughly 1000 tokens

Second, the context window is measured in tokens. The maximum amount of text a model can consider at once — which we explore fully in the next chapter — is a token count, not a word count or character count. When you are deciding whether your documents will fit, tokens are the unit that matters.

Why Tokenization Explains Strange Behavior

Some of the most baffling things language models do make perfect sense once you remember that they see tokens, not letters. This section alone explains a surprising amount of model weirdness.

Counting letters is hard. Ask a model how many times the letter "r" appears in "strawberry" and it may stumble. Why? It never sees the individual letters — it sees a couple of tokens. Counting characters inside a token is like being asked about ingredients you were only shown the finished dish of.
Spelling and reversing words is hard. For the same reason, tasks that operate on individual characters — spelling a word backward, for instance — are unnaturally difficult, because the model has to reason about letters it does not directly perceive.
Spaces and capitalization matter. The token for "cat" at the start of a sentence and the token for " cat" (with a leading space) mid-sentence can be different tokens entirely. Small formatting differences can change how text is tokenized.
Numbers and math can be awkward. Long numbers often split into odd token pieces, which is one reason models can fumble arithmetic on large numbers even when the logic is simple.
Other languages can cost more. Tokenizers are usually trained mostly on English, so text in other languages — and especially in non-Latin scripts — often breaks into more tokens per word, making it more expensive and consuming more of the context window.

Special Tokens

Not every token represents ordinary text. Models also use special tokens — invisible markers that signal structure. There are tokens that mark the beginning and end of a piece of text, and, crucially for chat models, tokens that mark the boundaries between the system instructions, the user's messages, and the assistant's replies.

Remember the "list of dictionaries" message format from Chapter 4, with its role and content fields? Behind the scenes, those roles are converted into special tokens that tell the model where one speaker stops and another begins. The tidy structure you write in code is translated into a stream of tokens — ordinary ones for the words, special ones for the structure — and that combined stream is what the model actually reads.

Summary

Tokenization is the first step in a model's pipeline, turning raw text into the units it works with. Pure word tokens explode the vocabulary and cannot handle new words; pure character tokens are flexible but hopelessly inefficient. Subword tokenization, learned from data by a frequency-merging procedure like byte-pair encoding, strikes the balance: common words stay whole, rare words split into reusable pieces, and anything at all can be represented. Because models see tokens rather than words or letters, tokens are the unit of both cost and context limits, and tokenization explains a whole family of odd behaviors — from miscounting letters to mishandling other languages. Special tokens, finally, encode the structure of a conversation.

Now that we know how text becomes tokens, Chapter 12 returns to attention to examine how many tokens a model can hold in mind at once — the context window — and why that limit shapes everything you will build.

Practice

Exercises

1Using any tokenizer tool (most model providers offer one online), tokenize three sentences of your choice and record the token count for each. Compare it to the word count and note the ratio you observe.
2Find a word that splits into three or more tokens, and one that stays a single token. Explain, using the idea of frequency and subword pieces, why each was tokenized the way it was.
3Estimate the token count of a 500-word email using the rule of thumb in the chapter, then verify your estimate with a real tokenizer. How close was your estimate?
4Ask a language model to count how many times a specific letter appears in a long word, and to spell a word backward. Describe what happens, and explain it in terms of tokens rather than letters.
5Take one sentence in English and the same sentence translated into another language (especially one with a non-Latin script if you can). Tokenize both and compare the token counts. What does the difference imply for cost?
6Explain to a friend why a model billed 'per token' might charge differently for two messages that have the same number of words. Use at least two reasons from this chapter.

View detailed solutions for all chapters →