Temperature
Temperature controls how random an LLM's token choices are: low values make output focused and repeatable, high values make it varied and creative.
Temperature is the sampling parameter that scales how confidently an LLM commits to its top token choices: near 0, it almost always picks the most probable next token; higher, it spreads probability across alternatives and output gets more varied.
Mechanically, the model produces a probability distribution over its vocabulary for each token; temperature divides the logits before sampling. Low temperature sharpens the distribution (focused, repeatable, sometimes repetitive), high temperature flattens it (diverse, surprising, occasionally off the rails). It pairs with top-p, which truncates the candidate pool rather than reshaping it — common guidance is to tune one, not both.
The practical defaults: deterministic-leaning for anything machine-consumed (structured output, code, extraction), moderate for chat, higher only when variety is the point. And note the era's caveat: reasoning models often fix or constrain sampling parameters during thinking — check your provider's docs before assuming the dial does what it did in 2023.
Frequently asked questions
- What temperature should I use?
- Match it to the task's tolerance for variation: at or near 0 for extraction, classification, code, and anything tests or parsers consume; moderate (~0.5–0.8) for general assistance; higher (~0.8–1.2) for brainstorming and creative variety. When in doubt, lower it — most production failures from sampling are too-much-randomness, not too-little.
- Does temperature 0 make outputs deterministic?
- Mostly but not perfectly. It selects the highest-probability token (greedy decoding), which removes sampling randomness — but ties, floating-point nondeterminism, and provider-side batching can still produce occasional variation. Treat temperature 0 as 'maximally consistent', not a cryptographic guarantee.
Related
- Top-p (Nucleus Sampling)Top-p sampling restricts an LLM's next-token choices to the smallest set whose probabilities sum to p — cutting the long tail of unlikely tokens adaptively.
- Structured OutputStructured output makes an LLM return data in a guaranteed shape — JSON matching your schema — so code can consume model responses without parsing prose.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Few-Shot vs Chain-of-Thought vs Structured Prompting: What to Use When (2026)When to reach for few-shot examples, chain-of-thought reasoning, or structured/output-constrained prompting — a 2026 decision guide to the core techniques.