Extended Thinking
Extended thinking is the reasoning tokens a model generates before its final answer, trading latency and cost for higher accuracy on hard problems.
Extended thinking is a model's ability to generate a stream of internal reasoning tokens — sometimes called thinking or reasoning tokens — before committing to a final answer, spending more computation to solve harder problems.
It's the defining feature of reasoning models: Claude's extended thinking and OpenAI's o-series both work this way, producing a separate block of step-by-step reasoning that the model uses to check its own work before responding. This is the same idea as chain-of-thought, but native to the model rather than prompted — and typically you set a thinking budget (a token cap) that scales how long the model deliberates.
The tradeoff is direct: more thinking means more tokens, higher latency, and higher cost, in exchange for measurably better accuracy on math, planning, and complex coding. The practical caveat is that thinking isn't free quality — on simple tasks it adds delay and expense for no gain, and an overlarge budget can let a model overthink a question it would have nailed instantly. Match the budget to the problem.
Frequently asked questions
- What's the difference between extended thinking and chain-of-thought prompting?
- Chain-of-thought is a prompting technique you trigger with instructions like 'think step by step.' Extended thinking is a built-in model capability: the model produces a dedicated stream of reasoning tokens before answering, often with a budget you control. The mechanism overlaps, but extended thinking is native rather than coaxed.
- When is extended thinking worth the extra cost?
- Use it for math, multi-step planning, complex coding, and analysis where one wrong step derails the answer. Skip it for simple lookups, formatting, or chat — the latency and token cost buy nothing there. Tune the thinking budget to the difficulty of the task rather than maxing it out by default.
Related
- Reasoning ModelA reasoning model is an LLM trained to think before answering — generating internal reasoning tokens it can spend adaptively on hard problems.
- Chain-of-Thought (CoT)Chain-of-thought prompting has a model work through intermediate reasoning steps before answering — improving accuracy on multi-step problems.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Test-Time ComputeTest-time compute is spending more computation at inference — longer reasoning, sampling, or search — to improve answers without retraining the model.
- Context EngineeringTreating the context window as a finite budget — what to load, what to leave out, and when to reset.
- Context EngineeringContext engineering is the discipline of curating exactly what enters an LLM's context window so it has the right information and nothing else.