Tokenization

What tokenization is (and why it exists)

LLMs can’t ingest raw characters directly. Models and neural networks only understand numbers or better vectors.

Therefore the first thing an LLM has to do is to "somehow" convert text (or pictures/music etc. ) into vectors. This is the responsibility of the Tokenization and Embedding subcomponents.

Tokenization is the deterministic step that:

Segments text into tokens (subword/byte pieces),
Maps each token to an ID (so the model can fetch its vector from the embedding matrix),
Compresses text length so sequences fit context windows and GPUs efficiently,
Handles any language/rare word via subword or byte-level splits (no “OOV”/unknown word problem),
Stabilizes training with consistent segmentation across billions of examples.

It is important to understand that there are many different ways to Tokenize text and in general the different approaches in place today try to minimize the vocabulary (the number of tokens needed for the Model to generate meaningful output.

This process of minimization is in place for 2 reasons:

Performance of the model. If the Vocabulary it too big there will be required an higher computational power to train and run the model. In future this might not be an issue.
Cost. As mentioned below today LLM are priced in number of Tokens not words or sentences. This is pretty much related to the point above.

Different LLM providers have come up with different Tokenization approached (although similar) and at the core they tried to minimize the number of tokens necessary to manage the different words and languages (let's not forget those models manage multiple languages with the same token set (The vocabulary)..

What it is really interesting to me that performance or cost apart. Whatever tokenization approach actually works which means the process of learning and have functional LLM is independent from the tokenization used.

Do LLMs need subword tokenizers to learn ?

LLM they need some way to turn bytes into vectors. Tokenization is mainly an efficiency + inductive-bias choice—not a fundamental requirement. There’s good evidence for both sides:

Tokenizer choice matters. Multiple studies show measurable shifts in accuracy, reasoning (e.g., counting), and cost when you swap tokenizers; standard quality proxies (like fertility/parity) don’t always predict downstream performance. In short: the tokenizer can help or hurt you.(Reference)
But tokenization isn’t strictly necessary. “Tokenization-free” or byte-level models (operate directly on raw bytes) work surprisingly well. ByT5 showed a near-vanilla Transformer can model bytes; Charformer improves efficiency at character/byte level; newer architectures like MambaByte, MEGABYTE, and Byte Latent Transformer (BLT) push byte-level modelling to match or rival subword models under controlled compute. The trade-off is usually longer sequences and more compute unless you add hierarchical/patching tricks.(reference)

Bottom line the emerging consensus is that LLM largely do not need tokenization in principle, but currently they do in practice. With enough data/compute and the right architecture, models can learn from raw bytes; but the inductive bias of a good tokenizer can make learning much more sample- and compute-efficient, and can shape what the model finds easy or hard.

Future research

Few key points from the following article: Why Your Next LLM Might Not Have A Tokenizer | Towards Data Science

The article highlights how Tokenization algorithms are the only one not trained. Everything else in LLMs is learned and trained but Tokenization is a simple and static process.

Also explains that spelling or imprecision in the input can generate different token sets despite the meaning is the same. This forces the Transformer network down stream to learn all those patterns from the training phase.

Realistic tokenization examples (OpenAI-style; Anthropic is similar)

Shown as bracketed tokens; leading spaces stick to the next token.

"Hello world!"
→ [Hello][ world][!] → 3 tokens
Simple words often map 1:1; punctuation is its own token.
"Email me at test@example.com"
→ [Email][ me][ at][ test][@][example][.com] → 7 tokens
URLs/emails split into meaningful pieces; helps reuse subwords across domains.
"I have 10 apples."
→ [I][ have][ 10][ apples][.] → 5 tokens
Numbers and punctuation are distinct; good for copy-exact tasks.
"🐍 Python"
→ [🐍][ Python] → 2 tokens
Emoji are usually single tokens; sequences can take more.
Newlines "Line 1\nLine 2"
→ [Line][ 1][\n][Line][ 2] → 5 tokens
Whitespace/newlines are explicit tokens—prompt formatting matters.

Anthropic’ s tokenizer will yield very similar splits and counts, but not always identical. For tight budgets/limits, measure with the provider you’ll use.

Why this matters in practice

Costs & limits: Pricing and context windows are in tokens, not words. Trimming boilerplate or long URLs lowers cost.
Quality & safety: Consistent splitting improves retrieval of the right embedding vectors, which improves downstream attention.
Multilingual & domain coverage: Subword/byte vocabularies cover rare names, code, emoji—no “unknown token” failures.
Latency & throughput: Fewer tokens → shorter sequences → faster decoding and cheaper batching.

Quick optimization tips - reduce the amount of tokens but keep the same context.

Keep prompts concise; remove redundant instructions - This will drive less tokens
Watch token-heavy regions (long links, code blocks, repeated disclaimers).
Prefer structured inputs (lists, key–value) over rambling prose.
If close to a limit, measure tokens with the target provider’s tokenizer before sending.

Page updated

Google Sites

Report abuse