Test-time Compute

Definition

Test-time compute means spending additional computation during inference, after training, to improve answer quality. It is central to discussions about reasoning models, search, verification, and agent performance.

AI progress is often described in terms of training data, model size, and training compute. Test-time compute shifts attention to what happens after training. Test-time compute is the extra computation spent during inference to improve the quality, reliability, or confidence of a model's answer.

Training compute versus test-time compute

Training compute is used to create the model. Test-time compute is used each time the model answers a question or performs a task. It can involve generating multiple candidate answers, checking intermediate steps, running tools, searching for evidence, or spending more steps on planning. This makes it possible to improve results without changing the model weights.

Why it matters

Reasoning models and agents often rely on more inference-time work. A coding agent may write code, run tests, inspect failures, and revise. A research assistant may search, compare sources, and synthesize. These workflows can be more accurate than a single-shot answer, but they also cost more and take longer.

How to read AI news about it

When a model shows large gains on hard benchmarks, ask whether the gain comes from the base model, the inference procedure, tool use, or multiple attempts. Also check the latency and cost tradeoff. Extra compute may be worthwhile for high-value tasks such as code repair or analysis, but excessive for simple rewriting or classification.

Watch-outs

More compute does not guarantee correctness. A system can spend additional steps reinforcing a bad assumption or searching in the wrong place. The practical question is whether extra inference work is targeted, measurable, and paired with verification. Test-time compute is best read as a performance lever with costs, not as a free upgrade.