Test-Time Compute
Test-time compute is spending more computation at inference — longer reasoning, sampling, or search — to improve answers without retraining the model.
Test-time compute is the strategy of spending more computation at inference time — generating longer reasoning, sampling many candidate answers, or searching over solutions — to get better results from a fixed model without retraining it.
It's the scaling axis behind reasoning models. For years, gains came almost entirely from training larger models on more data; test-time compute showed that a model can also improve simply by being given more room to work at the moment it answers. In practice that means extended chain-of-thought reasoning (see extended thinking), drawing multiple samples and aggregating them, or running a search procedure over candidate steps.
This matters because it's tunable per query: hard problems get more compute, easy ones get less, and you can buy accuracy on demand rather than retraining. The tradeoff is cost and latency — every extra reasoning token or sampled candidate is paid for at inference, and returns diminish past a point. Beyond some budget, more thinking stops helping and just slows the response, so test-time compute is a dial to set against task difficulty, not a constant to crank.
Frequently asked questions
- How is test-time compute different from training-time scaling?
- Training-time scaling makes a model smarter by using more data and compute to set its weights once, up front. Test-time compute leaves the weights fixed and instead spends more effort per query at inference. It's a separate scaling axis: a fixed model can produce better answers simply by being allowed to think, sample, or search more.
- What are common ways to spend test-time compute?
- The main techniques are longer chains of reasoning, sampling many candidate answers and picking the best (such as majority vote or best-of-N), and explicit search over solution paths. Reasoning models bundle this into their generation. All of them cost more tokens and latency per query, so you trade speed and money for accuracy.
Related
- Reasoning ModelA reasoning model is an LLM trained to think before answering — generating internal reasoning tokens it can spend adaptively on hard problems.
- Chain-of-Thought (CoT)Chain-of-thought prompting has a model work through intermediate reasoning steps before answering — improving accuracy on multi-step problems.
- Extended ThinkingExtended thinking is the reasoning tokens a model generates before its final answer, trading latency and cost for higher accuracy on hard problems.
- Tree of ThoughtsTree of Thoughts is a prompting method that explores multiple reasoning branches as a search tree, evaluating and backtracking among them.