Reasoning Model
A reasoning model is an LLM trained to think before answering — generating internal reasoning tokens it can spend adaptively on hard problems.
A reasoning model is a language model trained to deliberate before responding — it generates internal "thinking" tokens that work the problem, then produces the answer, spending more thinking on harder problems.
The line of models that began in late 2024 turned chain-of-thought from a prompting trick into an architecture: reinforcement learning taught models that extended deliberation should change conclusions, not just narrate them. The practical consequence is test-time compute as a dial — the same model can answer instantly or think for thousands of tokens, trading latency and cost for reliability on hard problems. Modern frontier models blend the modes, with thinking budgets that adapt or can be set explicitly.
For builders the implications are concrete: thinking tokens are billed output tokens, so reasoning tiers change your cost envelope; prompts written for older models ("think step by step") may be redundant; and tier selection — when deliberation pays versus when it's overhead — becomes a real engineering decision, the same one Choosing the Right Model walks through for Claude's tiers.
Frequently asked questions
- How is a reasoning model different from a regular LLM?
- Training and inference budget. A standard model answers directly; a reasoning model first generates thinking tokens — exploring, checking, revising — then answers. It was trained (largely via reinforcement learning) for that deliberation to actually improve outcomes, and the thinking budget can scale with problem difficulty.
- When is a reasoning model worth the extra cost and latency?
- When the problem is genuinely multi-step and a wrong answer is expensive: hard debugging, architecture, math-flavored logic, intricate planning. For extraction, classification, and routine generation, thinking tokens are overhead — a fast standard tier does the job cheaper. Match the dial to the task, not the prestige.
Related
- Chain-of-Thought (CoT)Chain-of-thought prompting has a model work through intermediate reasoning steps before answering — improving accuracy on multi-step problems.
- Choosing the Right Model: Haiku vs Sonnet vs OpusHow to pick the right Claude model tier for an agent or task.
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
- Frontier ModelA frontier model is one of the most capable AI models available — the leading edge from labs like Anthropic, OpenAI, and Google, defining the state of the art.
- Mixture of Experts (MoE)MoE is a model architecture where a router activates only a few expert subnetworks per token — huge total capacity, a fraction of the compute per token.
- RLHF (Reinforcement Learning from Human Feedback)RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.