Chapter 12 of 25

Why "strawberry" Has Three R's (and the Model Can't Count Them)

Created May 28, 2026 Updated Jun 7, 2026

The "how many r's in strawberry" question keeps embarrassing LLMs for one reason: the model is not fed letters. It is fed tokens. It can pick up character-level facts from training data, but letters are never the units it actually receives as input.

A token is the unit of vocabulary the model operates on — typically 50,000 to 250,000 entries, learned by an algorithm like BPE (Byte-Pair Encoding) on a giant text corpus. BPE starts from bytes and greedily merges the most frequent adjacent pairs into new tokens. Frequent strings end up as single tokens. Rare strings get split into several. The model's input is a sequence of token IDs; its output is a probability distribution over those same IDs.

"strawberry" under the GPT-4 tokenizer (cl100k_base) splits into ["str", "aw", "berry"] — three tokens, none of which is a letter. (Mid-sentence, carrying its leading space, strawberry is actually a single token — which only sharpens the point.) Inside the model there is no r, no s — just integers indexing into a 100,000-row embedding table.

Asked to count rs, the model has to do something like: recover which characters each token "contains," count occurrences across that recovered string, output a number. That chain depends on whether the training data happened to contain enough letter-level supervision (spelling, phonics, character-counting puzzles) to make the mapping reliable. Often it doesn't. The model produces a plausible-sounding number rather than a correct one — not because it's bad at arithmetic, but because the letters were never represented as first-class positions in the model's input: the token, not the character, is the unit it can index.

The same pattern shows up everywhere tokenization warps the surface form:

Non-English text is more tokens per word. Earlier, smaller GPT tokenizers like cl100k_base split a typical Russian word into 2–4 tokens versus 1–2 for English. Newer ones (o200k_base, Llama 3/4's 128K BPE, Gemini's ~256K SentencePiece) narrow the gap but don't close it. Budget multiplier on non-English is real.
Numbers are tokenized inconsistently. "12345" may be one token or several; "12346" may split differently. Arithmetic then has to be learned over a representation that doesn't naturally preserve digit positions — the model can pick up algorithmic patterns, but the encoding works against it.
Code indentation, structural punctuation, emoji — each gets its own token allocation, which is why some prompts cost much more than their character count suggests.

The whole sequence of strange LLM behaviors around letters, digits, and spelling is the same phenomenon: the model's input is not what humans think text is. Once you know the tokenizer, the failures stop looking like reasoning bugs and start looking like representation bugs.

Full breakdown of how tokens, embeddings, and the autoregressive loop fit together: How LLM Generation Works.