Top-p (Nucleus Sampling)
Top-p sampling restricts an LLM's next-token choices to the smallest set whose probabilities sum to p — cutting the long tail of unlikely tokens adaptively.
Top-p (nucleus sampling) limits the model's next-token candidates to the smallest set whose cumulative probability reaches p — at p = 0.9, sampling happens only among tokens covering the top 90% of probability mass, and the unlikely tail is discarded.
Its virtue over a fixed top-k cutoff is adaptivity: when the model is confident, the nucleus may be two tokens; when many continuations are plausible, it widens automatically. That trims the failure mode of pure temperature sampling — rare, incoherent tokens occasionally getting picked — while preserving variety where it's genuine.
In practice top-p is a set-and-forget parameter (defaults around 0.9–1.0), tuned downward when outputs wander, with temperature as the primary creativity dial. The same caveat applies as everywhere in sampling-land: machine-consumed output wants minimal randomness, and reasoning models may constrain these parameters — read the provider's current docs rather than cargo-culting 2023 settings.
Frequently asked questions
- What's the difference between top-p and temperature?
- Temperature reshapes the whole probability distribution (how bold the model is across all options); top-p truncates it (which options are even on the table — the smallest set summing to probability p). Temperature changes relative weights; top-p removes the tail. Standard advice: adjust one and leave the other at its default.
- Why 'nucleus' sampling?
- The candidate set it keeps — the smallest group of tokens whose combined probability reaches p — is called the nucleus. Its size adapts to the model's confidence: a near-certain next token yields a tiny nucleus, an open-ended continuation yields a large one. That adaptivity is its advantage over fixed top-k.
Related
- TemperatureTemperature controls how random an LLM's token choices are: low values make output focused and repeatable, high values make it varied and creative.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Structured OutputStructured output makes an LLM return data in a guaranteed shape — JSON matching your schema — so code can consume model responses without parsing prose.