Token (LLM)
A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
A token is the basic unit a language model reads and writes — typically a word fragment averaging 3–4 characters of English text. Everything about LLMs is denominated in tokens: pricing, context limits, and speed.
Models don't see letters or words; a tokenizer splits text into pieces from a fixed vocabulary, and the model predicts one token at a time. "Understanding" is a single token; "unfathomable" might be three. The practical conversions: ~100 tokens ≈ 75 English words; code and non-English text usually run denser.
Tokens matter because they're the meter on everything. API pricing is per million input and output tokens (output costing several times more — generation is sequential, reading is parallel). The context window is a token budget. Throughput is tokens per second. So the everyday engineering moves — trimming prompts, caching repeated prefixes, summarizing history — are all token economics; the full playbook is in LLM Cost and Latency Engineering.
Frequently asked questions
- How many tokens is a word?
- In English, roughly 0.75 words per token — about 100 tokens per 75 words. Common words are single tokens; rare words split into pieces; code, non-English text, and unusual formatting often cost more tokens per character. Exact counts come from the model's own tokenizer.
- Why are tokens what I pay for?
- Because tokens are what the model actually processes: each one costs compute on the way in (reading your prompt) and on the way out (generating the answer). That's why API pricing is per million input and output tokens, why output tokens cost more, and why trimming context is the most direct cost lever.
Related
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- Prompt CachingPrompt caching reuses the computed state of a repeated prompt prefix across requests — dramatically cutting cost and time-to-first-token for stable context.
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
- TemperatureTemperature controls how random an LLM's token choices are: low values make output focused and repeatable, high values make it varied and creative.
- Token StreamingToken streaming delivers model output incrementally as it's generated — via SSE or websockets — so users see text immediately instead of waiting.
- Top-p (Nucleus Sampling)Top-p sampling restricts an LLM's next-token choices to the smallest set whose probabilities sum to p — cutting the long tail of unlikely tokens adaptively.