# Perplexity

> Perplexity measures how well a language model predicts a text sample — the exponential of its average per-token negative log-likelihood. Lower is better.

**Perplexity is an intrinsic measure of how well a language model predicts a text sample — the exponential of the average negative log-likelihood it assigns per [token](/glossary/llm-token) — so it acts like the model's average "branching factor," and lower is better.**

Intuitively, perplexity is how *surprised* the model is by the next token on average. A perplexity of 10 means the model is, in effect, choosing among about 10 equally likely options at each step. As a model trains and improves, it assigns higher probability to the tokens that actually appear, the negative log-likelihood drops, and perplexity falls toward 1. Because it only needs the model's own probabilities on a held-out text, it is cheap to compute during [inference](/glossary/inference) — no human labels or graders required.

That cheapness makes perplexity useful for comparing checkpoints of the same model, picking training hyperparameters, or detecting domain mismatch (perplexity on legal text spikes for a model trained on chat). It is also a common proxy when validating a [distilled](/glossary/distillation) or quantized model against its parent.

The key caveat: perplexity scores prediction of a reference text, not task quality, helpfulness, or factuality — a fluent, confidently wrong answer can have low perplexity. It is also not comparable across different tokenizers or datasets, since the per-token unit shifts. For "does this actually work," use task-level [evals](/glossary/eval-dataset); see [How to Write LLM Evals](/guides/evaluation/write-llm-evals).

---

_Source: https://agentscamp.com/glossary/perplexity — Term on AgentsCamp._
