Write Evals for an LLM App: From Zero to a CI Gate
How to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.
Evals turn 'this feels better' into a number. The method is the same whatever the feature: build a frozen dataset of real cases, pick the two or three metrics it's graded on, record a baseline, score every change offline, validate any LLM-as-judge against human labels, gate CI on the result, then monitor live traffic. Without a fixed eval set you are shipping on vibes.
Steps at a glance
- Define the scoring unit. State exactly what a correct output is and how one is judged — exact match, schema-valid, numeric tolerance, or a rubric. Resolve ambiguity before writing a metric.
- Build and freeze a dataset. Collect 20–50 representative inputs with expected behavior, oversampling hard and adversarial cases (empty input, ambiguity, the format that broke last time). Commit it under version control.
- Choose the few metrics that matter. Pick the two or three the feature is graded on. Prefer deterministic checks (exact match, schema validity) where they apply; use an LLM-as-judge only for genuinely subjective criteria.
- Record a baseline. Run the current or naive system over the full dataset and save the score. Every later result is compared to this number.
- Validate any LLM judge. If you use an LLM-as-judge, give it an explicit rubric and reference examples, then check its agreement against 20–30 human-labeled cases before relying on it.
- Gate CI. Wire the suite into CI so a metric dropping below threshold fails the build — quality changes are caught in PRs, not in production.
- Monitor and feed back. Trace production, score a sample of live traffic, and add real failures to the eval dataset so the same bug can't return.
Key takeaways
- Build the dataset first and freeze it — 20–50 real cases, oversampling the hard and adversarial ones, beat a thousand synthetic ones.
- Pick the two or three metrics the feature is actually graded on; prefer deterministic checks over an LLM judge where possible.
- Record a baseline, then change one variable at a time and compare against it.
- An LLM-as-judge must be calibrated and validated against human labels before you trust its scores.
- Make evals a CI gate, then monitor production and feed real failures back into the dataset.
You changed the prompt. Is the feature better, or did you just fix the three examples you happened to look at while quietly breaking twenty you didn't? Without evals, you cannot answer that — and LLM features regress silently, because a change that helps one input often hurts another. Evals turn "this feels better" into a number you can defend. This guide is the practical method, the same whether you're building extraction, RAG, an agent, or a chatbot.
The one rule: a frozen dataset and a baseline
Everything else is detail. If you have a fixed set of cases with expected behavior and a recorded baseline score, you can measure any change. If you don't, you're guessing. So the first deliverable is never a metric or a tool — it's the dataset.
Build the dataset first
Collect 20–50 representative inputs and what good output looks like for each. The instinct to generate thousands of synthetic cases is a trap: coverage of failure modes beats volume. Deliberately oversample the cases that break things — empty or malformed input, ambiguity, the edge case that caused last month's incident, the prompt-injection attempt. Then freeze the set under version control. A moving eval set can't measure progress.
TIP
Twenty real, adversarial cases you understand are worth more than a thousand bland synthetic ones. Grow the set by harvesting real production failures over time, not by generating filler.
Choose the few metrics that matter
Pick the two or three the feature is actually graded on, not every metric a framework offers:
- Deterministic checks — exact match, JSON-schema validity, a regex, a numeric tolerance. Cheap, fast, perfectly consistent. Use them wherever they apply.
- RAG metrics — faithfulness (is the answer grounded in the retrieved context?), answer relevancy, context precision/recall. (See RAGAS and How RAG Actually Works.)
- LLM-as-judge — for genuinely subjective output (tone, helpfulness, summary quality). Powerful but easy to get wrong; build it deliberately (next section).
Score offline, then add a judge
Run your metrics over the frozen dataset to get a baseline, then change one variable at a time and compare. For subjective criteria, an LLM-as-judge scales human judgment — but only if calibrated: an explicit rubric, a small anchored scale, reference examples, and controls for length/position/self-preference bias. Validate the judge against 20–30 human-labeled cases before you trust it (the llm-as-judge-scorer skill walks this). An unvalidated judge is just confident noise with a number attached.
Make it a CI gate
An eval suite you run by hand is an eval suite you'll stop running. Wire it into CI so a metric falling below threshold fails the build — now every prompt or model change is scored automatically, and regressions are caught in the PR. The run-evals command and llm-eval-suite-scaffolder skill set this up; DeepEval and promptfoo are built for it.
WARNING
Never tune against the cases you report on, and never relax a threshold just to go green. A gamed suite is worse than none — it manufactures false confidence. If a threshold is genuinely wrong, change it in its own commit with a rationale.
Then watch production
Offline evals prevent regressions; they can't predict every real-world input. After you ship, trace production and run online evals on a sample of live traffic to catch drift and new failure modes — then add those failures back to the offline dataset so the same bug can't return. That feedback loop is what the llm-observability-engineer and llm-evaluation-engineer own together.
For which tool to build all this on, see Best LLM & RAG Evaluation Tools in 2026.
Frequently asked questions
- How do I evaluate an LLM application?
- Build a frozen dataset of representative inputs with expected behavior, choose the two or three metrics the feature is graded on, record a baseline score, then score every prompt or model change against the dataset. For subjective outputs, use a calibrated LLM-as-judge validated against human labels. Finally, gate CI on the metrics and monitor production. The key is a fixed dataset and a baseline — without them, 'better' is just a feeling.
- How many test cases do I need for an LLM eval?
- Start with 20–50 real, well-chosen cases. Coverage of failure modes matters far more than volume — deliberately include the hard, ambiguous, and adversarial inputs that break things. Twenty thoughtful cases beat a thousand bland synthetic ones; grow the set over time by adding real production failures.
- Is LLM-as-a-judge reliable?
- It can be, if you build it carefully: an explicit rubric, a small discrete scale with anchors, reference examples, and controls for known biases (length, position, self-preference). Crucially, validate it against human labels before trusting it — an uncalibrated judge is confident noise. Prefer a deterministic check whenever one applies.
- What's the difference between offline evals and online evals?
- Offline evals run a fixed dataset before you ship — they're your regression gate in CI. Online evals score a sample of real production traffic after you ship — they catch drift and failure modes your dataset didn't cover. You need both: offline to prevent regressions, online to discover what to add to the offline set.
Related
- Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfooA decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.
- LLM Evaluation EngineerUse this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".
- LLM Eval Suite ScaffolderStand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.
- LLM As Judge ScorerDesign a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.
- Run EvalsRun the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.
- Prompt EngineerUse this agent to design and iterate the prompts behind an LLM-powered product feature — instructions, few-shot examples, tool schemas, and the evals that prove a change actually helped. Examples — "this classification prompt is flaky, make it reliable", "design the system prompt and function schema for our support agent", "our extraction prompt regressed after I tweaked it, set up evals so this stops happening".
- Finetuning EngineerUse this agent to fine-tune an open-weight model end to end — confirming fine-tuning is the right tool, preparing the dataset, choosing the method (LoRA/QLoRA vs. full), running training, and proving the result beats the prompted baseline on a held-out eval set. Examples — "fine-tune a small model to match our support tone and answer format", "we have 800 labeled examples — LoRA-tune and show it beats prompting", "our fine-tune overfits and forgot general ability — fix the data and run".
- LLM Observability EngineerUse this agent to make a production LLM app observable — tracing every step, scoring live traffic with online evals, and monitoring quality, cost, and latency — so you can debug agent runs and catch regressions in production. Examples — "add tracing to our RAG/agent so we can debug bad answers", "set up online evals and cost/latency dashboards", "production quality is slipping and we're flying blind".
- Preparing a Fine-Tuning Dataset: Cleaning, Synthetic Data, and Eval SplitsThe dataset is the model. How to build a fine-tuning dataset that works — format, curation, cleaning, synthetic augmentation, and a leak-free eval split.
- DeepEvalAn open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.