Tokenization
Tokenization splits text into tokens — the sub-word units a model reads and writes — and maps each to an integer ID the model processes.
Tokenization is the process of splitting text into tokens — the sub-word units a model actually reads and generates — and mapping each one to an integer ID.
Before a language model sees your prompt, a tokenizer breaks it into pieces drawn from a fixed vocabulary, usually with a scheme like byte-pair encoding (BPE) that merges frequent character sequences into single units. Common words become one token; rare words, names, and made-up strings split into several. Each unit maps to an integer ID, and those IDs — not the raw characters — are what the model embeds and predicts. On average one token is roughly three-quarters of an English word.
This is why character counts never equal token counts, and why pricing and context window limits are measured in tokens rather than words. Some inputs tokenize less efficiently: code, rare or non-English words, and unusual whitespace all pack more tokens per character, quietly inflating cost and length.
A key caveat: each model family has its own tokenizer, so token counts are not comparable across providers. Always count against the model you actually call — see the LLM API pricing guide for what that means for your bill.
Frequently asked questions
- Why don't character counts equal token counts?
- Because a tokenizer groups characters into sub-word units, not single letters. A common word like 'the' is one token; a rare word splits into several. On average one token is about 3–4 characters or 0.75 words of English, so the same text maps to far fewer tokens than characters — and the ratio shifts with the content.
- Are token counts the same across providers?
- No. Each model family ships its own tokenizer with its own vocabulary, so the same string yields different token counts on different models. You can't compare prices or context limits across providers by raw token numbers — count with the specific model's tokenizer.
Related
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- EmbeddingAn embedding is a vector of numbers representing text's meaning, placed so similar texts land close together — the foundation of semantic search and RAG.
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- LLM API Pricing in 2026: Every Major Model ComparedPer-million-token prices for Claude, GPT, Gemini, DeepSeek, Mistral, and Grok — plus caching and batch discounts — verified against vendor pricing pages.
- Attention MechanismAttention lets a model weigh how relevant every other token is to each token, building a context-aware representation as a weighted blend of their values.
- TransformerThe neural-network architecture (Vaswani et al., 2017) that uses self-attention to process sequences in parallel — the basis of nearly all modern LLMs.