LLM Eval Suite Scaffolder
Stand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.
Install to ~/.claude/skills/llm-eval-suite-scaffolder/SKILL.md
Turns 'we should add evals' into a running suite: it builds a representative dataset (oversampling the hard cases), picks the two or three metrics the feature is actually graded on, records a baseline, and wires the suite into CI so prompt and model changes are scored, not guessed.
The hardest part of LLM evaluation is starting. This skill scaffolds a complete, runnable eval suite for a feature — dataset, metrics, baseline, and CI wiring — using the framework that fits the stack (DeepEval for Python/pytest, promptfoo for config-driven CLI, RAGAS for RAG-specific metrics).
When to use this skill
- An LLM feature ships with no evals and you need a gate before changing it further.
- You're about to tune a prompt or swap a model and want to measure the change, not guess.
- You're adding an LLM feature to CI and need a suite that fails on regressions.
Instructions
- Pin the task and the unit of scoring. State exactly what the feature must produce and how one output is judged: exact match, JSON-schema valid, a numeric tolerance, or an LLM-as-judge rubric. An ambiguous success criterion is the real bug — resolve it first.
- Build a representative dataset. Collect 20–50 real inputs with expected behavior, deliberately oversampling hard and adversarial cases (empty input, ambiguity, the format that broke last time, the prompt-injection attempt). Freeze it under version control. For RAG, capture the gold passages too.
- Pick the few metrics that matter. Two or three the feature is actually graded on — not every metric the framework offers. Faithfulness and answer relevancy for RAG; task accuracy and format validity for extraction; a calibrated rubric (llm-as-judge-scorer) for open-ended output.
- Choose the framework and scaffold it. Generate the suite: DeepEval (pytest-style assertions), promptfoo (YAML matrix), or RAGAS (RAG metrics). Wire the dataset and metrics in, with thresholds.
- Record a baseline. Run the current/naive prompt over the full set and commit the score. Every later number is compared to this.
- Wire the CI gate. Add a
run-evalsstep that fails the build when a metric drops below threshold, so regressions are caught in PRs — see the Run Evals command.
WARNING
Don't generate hundreds of synthetic cases and call it an eval set. Twenty real, well-chosen cases — including the adversarial ones — beat a thousand bland synthetic ones. Quality and coverage of failure modes, not volume.
Output
A runnable eval suite committed to the repo: the frozen dataset, the chosen metrics with thresholds, a recorded baseline score, and a CI step that gates merges on it.
Related
- Write Evals for an LLM App: From Zero to a CI GateHow to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.
- DeepEvalAn open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.
- promptfooAn open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.
- LLM As Judge ScorerDesign a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.
- Run EvalsRun the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.
- LLM Evaluation EngineerUse this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".