# Eval Dataset

> An eval dataset is the curated set of test cases — inputs with expected outcomes — that an LLM application's quality is measured against.

**An eval dataset is the fixed, curated collection of test cases an LLM feature is scored against — each case an input plus its expected outcome (a reference answer, a rubric, a constraint) — the foundation every other evaluation machinery stands on.**

It's the LLM era's test suite, with one difference from unit tests: outcomes are often judged (by [LLM-as-judge](/glossary/llm-as-judge) or humans against rubrics) rather than exact-matched, which makes *case quality* even more load-bearing — a vague expected outcome produces a meaningless score. The curation discipline: mine **real traffic** for the head of the distribution, promote **every production failure** into a permanent regression case, and [synthesize](/glossary/synthetic-data) the edge cases your logs haven't produced yet — keeping the dataset versioned, since changing it silently breaks score comparability.

Its strategic role is bigger than testing: the eval dataset is where a team's *definition of good* becomes executable — the artifact that turns "the new prompt feels better" into a number that gates releases. Building one from zero is the first half of [Write Evals for an LLM App](/guides/evaluation/write-llm-evals), scaffoldable via the [llm-eval-suite-scaffolder](/skills/data/llm-eval-suite-scaffolder) skill.

---

_Source: https://agentscamp.com/glossary/eval-dataset — Term on AgentsCamp._