# Write Evals for an LLM App: From Zero to a CI Gate

> How to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.

Evals turn 'this feels better' into a number. The method is the same whatever the feature: build a frozen dataset of real cases, pick the two or three metrics it's graded on, record a baseline, score every change offline, validate any LLM-as-judge against human labels, gate CI on the result, then monitor live traffic. Without a fixed eval set you are shipping on vibes.

You changed the prompt. Is the feature better, or did you just fix the three examples you happened to look at while quietly breaking twenty you didn't? Without evals, you cannot answer that — and LLM features regress silently, because a change that helps one input often hurts another. **Evals turn "this feels better" into a number you can defend.** This guide is the practical method, the same whether you're building extraction, RAG, an agent, or a chatbot.

## The one rule: a frozen dataset and a baseline

Everything else is detail. If you have a fixed set of cases with expected behavior and a recorded baseline score, you can measure any change. If you don't, you're guessing. So the first deliverable is never a metric or a tool — it's the **dataset**.

## Build the dataset first

Collect 20–50 representative inputs and what good output looks like for each. The instinct to generate thousands of synthetic cases is a trap: **coverage of failure modes beats volume.** Deliberately oversample the cases that break things — empty or malformed input, ambiguity, the edge case that caused last month's incident, the prompt-injection attempt. Then freeze the set under version control. A moving eval set can't measure progress.

> [!TIP]
> Twenty real, adversarial cases you understand are worth more than a thousand bland synthetic ones. Grow the set by harvesting real production failures over time, not by generating filler.

## Choose the few metrics that matter

Pick the two or three the feature is actually graded on, not every metric a framework offers:

- **Deterministic checks** — exact match, JSON-schema validity, a regex, a numeric tolerance. Cheap, fast, perfectly consistent. Use them wherever they apply.
- **RAG metrics** — faithfulness (is the answer grounded in the retrieved context?), answer relevancy, context precision/recall. (See [RAGAS](/tools/ragas) and [How RAG Actually Works](/guides/concepts/how-rag-works).)
- **LLM-as-judge** — for genuinely subjective output (tone, helpfulness, summary quality). Powerful but easy to get wrong; build it deliberately (next section).

## Score offline, then add a judge

Run your metrics over the frozen dataset to get a **baseline**, then change one variable at a time and compare. For subjective criteria, an **LLM-as-judge** scales human judgment — but only if calibrated: an explicit rubric, a small anchored scale, reference examples, and controls for length/position/self-preference bias. **Validate the judge against 20–30 human-labeled cases before you trust it** (the [llm-as-judge-scorer](/skills/data/llm-as-judge-scorer) skill walks this). An unvalidated judge is just confident noise with a number attached.

## Make it a CI gate

An eval suite you run by hand is an eval suite you'll stop running. Wire it into CI so a metric falling below threshold **fails the build** — now every prompt or model change is scored automatically, and regressions are caught in the PR. The [run-evals](/commands/testing/run-evals) command and [llm-eval-suite-scaffolder](/skills/data/llm-eval-suite-scaffolder) skill set this up; [DeepEval](/tools/deepeval) and [promptfoo](/tools/promptfoo) are built for it.

> [!WARNING]
> Never tune against the cases you report on, and never relax a threshold just to go green. A gamed suite is worse than none — it manufactures false confidence. If a threshold is genuinely wrong, change it in its own commit with a rationale.

## Then watch production

Offline evals prevent regressions; they can't predict every real-world input. After you ship, **trace production and run online evals** on a sample of live traffic to catch drift and new failure modes — then add those failures back to the offline dataset so the same bug can't return. That feedback loop is what the [llm-observability-engineer](/agents/data-ai/llm-observability-engineer) and [llm-evaluation-engineer](/agents/data-ai/llm-evaluation-engineer) own together.

For which tool to build all this on, see [Best LLM & RAG Evaluation Tools in 2026](/guides/evaluation/best-llm-eval-tools-2026).

---

_Source: https://agentscamp.com/guides/evaluation/write-llm-evals — Guide on AgentsCamp._
