Prompt Regression Tester
Build a regression test harness for an LLM prompt so a prompt edit or model upgrade can't silently degrade quality — a fixed eval set, checkable assertions, and a diff against a committed baseline. Use when changing a production prompt, migrating model versions, or any time 'I tweaked the prompt' needs to be backed by evidence instead of eyeballing two outputs.
npx agentscamp add skills/prompt-regression-testerInstall to ~/.claude/skills/prompt-regression-tester/SKILL.md
Eyeballing two outputs isn't testing a prompt change. This skill builds a real regression harness: a fixed eval set (including the cases that previously broke), checkable assertions per input, an optional LLM-judge for genuinely subjective qualities, and a diff against a committed baseline that flags every regression before you ship the edit.
A prompt edit that "looks fine on a couple of examples" is the single most common way teams ship a quality regression. The fix is not heroic — it's a fixed eval set, assertions a machine can check, and a baseline you diff against. This skill builds that harness so any prompt change or model migration produces a pass/fail + regression report instead of a vibe.
When to use this skill
- You're about to edit a production prompt and want proof the edit doesn't regress existing behavior.
- You're migrating models (e.g.
claude-opus-4-7→claude-opus-4-8, or onto a new provider) and need to confirm output quality holds. - A prompt regressed in the past and you want a committed test that would have caught it.
- You're standing up a new prompt and want defensible coverage before it goes live.
- Someone "improved" a prompt and you need to confirm it actually improved rather than traded one failure for another.
Instructions
-
Find the prompt and its call site first.
Grepfor the system/user prompt text, the model ID (claude-,gpt-,model=,model:), and the SDK call (messages.create,chat.completions, etc.). Pin down exactly what's under test: the prompt string, the model ID, and every generation parameter (effort/temperature/max_tokens/tools). The harness must reproduce that call exactly — a regression test that uses different params than production tests the wrong thing. -
Assemble a FIXED eval set — and freeze it. Collect 15–40 representative inputs into a version-controlled file (
evals/cases.jsonl, one JSON object per line:{id, input, ...}). Include, deliberately: the happy-path cases, the boundary cases, and — most importantly — every input that has previously failed (past bug reports, support tickets, the example that motivated the last prompt edit). The set is an asset; it only grows by deliberate commit, never shrinks or mutates per run.WARNING
An eval set that changes between runs is not a regression test — it's a fresh experiment each time, and the diff against baseline is meaningless. Do not generate inputs on the fly, sample randomly, or pull "the latest N production logs" at runtime. Commit the inputs.
-
Write checkable assertions per case — prefer deterministic over judged. For each input, specify what "correct" means as machine-checkable expectations, picking the tightest check that fits:
exact— output equals a string (classification labels, enum values).contains/not_contains— a required substring is present / a forbidden phrase (a leaked secret, a banned disclaimer, "As an AI") is absent.regex— a format pattern (an order ID shape, a date, a currency string).json_schema— output parses as JSON and validates against a schema (the highest-value check for structured-output prompts; catches missing fields, wrong types, extra keys).structural— list length, required keys present, sorted order, no duplicates.
Store assertions alongside each case. A single case may carry several. Reserve an LLM-as-judge only for qualities no code can decide — tone, helpfulness, faithfulness to a source, "did it refuse appropriately" — and even then express the judge as a rubric with a discrete verdict (
pass/failor a 1–5 score with a threshold), not a freeform "rate this." -
Default to the latest, most capable model for both the system-under-test and the judge. Use
claude-opus-4-8(current most-capable Opus; 1M context, adaptive thinking) unless the prompt's production config pins a different model — in which case test that model and addclaude-opus-4-8as a comparison column. For the judge, also default toclaude-opus-4-8: a weak judge is a noisy judge. Use adaptive thinking (thinking: {type: "adaptive"}) for the judge; do not sendtemperature/top_p/budget_tokens(removed on Opus 4.7/4.8 — they 400). Keep the judge model separate and named in the report so a judge swap is never confused with a quality change. -
Run the set across each variant under test and score every case. A "variant" is a (prompt, model, params) tuple — typically
baseline(current production) vscandidate(your edit). Iterate the frozen cases, call the real SDK with the exact production config, run each assertion, and record a per-case result:pass/fail, which assertion failed, and the raw output. Run candidate and baseline in the same invocation so they see identical inputs. Cache or store raw outputs so a re-score (e.g. after fixing an assertion) doesn't require re-generating. -
Diff against the committed baseline and flag regressions. Snapshot the baseline results to
evals/baseline.json(committed). On each run, compare candidate-vs-baseline per case and classify:- Regression — passed in baseline, fails now. (The line that blocks the PR.)
- Fix — failed in baseline, passes now. (The intended win — confirm it's real, not luck.)
- Unchanged pass / unchanged fail — note but don't alarm.
WARNING
Don't treat "candidate pass-rate ≥ baseline pass-rate" as green. Aggregate rate hides swaps — a candidate can fix two cases and break two others for the same headline number. The per-case regression list is the signal; the aggregate is a summary, not the gate.
-
Make the baseline a deliberate, reviewed artifact. Update
evals/baseline.jsononly via an explicit "accept" step (a--update-baselineflag or a separate command), and commit it in the same PR as the prompt change so a reviewer sees both the new prompt and exactly which case results moved. Never let the harness silently rewrite the baseline on every run — that's how a regression gets absorbed into "the new normal" unnoticed.NOTE
LLM outputs vary run-to-run even at fixed settings, so a single judged case flipping isn't necessarily a real regression. For judge-scored cases, run N=3 and require a majority verdict before flagging; for deterministic assertions, one failure is one failure — no repetition needed.
Output
The skill produces a committed, runnable harness:
- Layout —
evals/ cases.jsonl # frozen inputs + per-case assertions (version-controlled) baseline.json # accepted baseline results (version-controlled) run.(py|ts) # loads cases, runs each variant, scores, diffs schema/*.json # json_schema assertions, if any - The assertion set — each case lists its checks, e.g.
{"id": "refund-001", "input": "...", "assert": [ {"type": "json_schema", "schema": "schema/refund.json"}, {"type": "not_contains", "value": "As an AI"}, {"type": "judge", "rubric": "Output politely declines without admitting fault.", "threshold": "pass"} ]} - A pass/fail + regression report — printed and written to
evals/report.md:Variant: candidate (prompt@HEAD, claude-opus-4-8) vs baseline (prompt@main, claude-opus-4-8) Judge: claude-opus-4-8 42 cases | 39 pass / 3 fail (baseline: 40 pass / 2 fail) REGRESSIONS (2) ← blocks merge refund-001 json_schema: missing field "reason" tone-014 judge(2/3): newly apologetic where baseline was neutral FIXES (1) parse-007 regex: now matches order-id format Exit code: 1 (regressions present)
A non-zero exit on any regression makes the harness drop straight into CI. Report file paths back as absolute paths so the user can wire it up.
Frequently asked questions
- Do I need an LLM-as-judge for every case?
- No — reserve it for genuinely subjective qualities (tone, helpfulness, faithfulness). Anything checkable (a JSON field, a required substring, a forbidden phrase, a structural shape) should be a deterministic assertion. Judges are slower, costlier, and noisier; use them only where code can't decide.
- Why diff against a baseline instead of just asserting pass/fail?
- A pure pass/fail bar misses silent drift — a case that still passes but got worse, or a case that started passing for the wrong reason. The committed baseline makes every change reviewable in a PR, so a regression is a visible red line, not a number nobody compared.
Related
- Create SkillScaffold a new Claude Code skill into .claude/skills/<name>/SKILL.md — a model-invoked capability with a trigger-rich description, scoped tools, and a lean body that pushes detail into resource files.
- Scaffold RAG PipelineScaffold a Retrieval-Augmented Generation pipeline — ingestion (load, chunk, embed, upsert) and retrieval (search, rerank, grounded prompt with citations) — fitted to the project's stack.
- Token Usage ProfilerMeasure and attribute LLM token usage and cost across an app — input vs output tokens by feature, route, model, and tenant — then rank the waste and the specific lever to cut it. Use when LLM spend is high or climbing with no clear cause, before scaling a feature that calls a model, or when you need per-feature or per-tenant cost attribution for billing or budgets.
- Testing LLM Applications: How to Test Non-Deterministic SoftwareHow to test software that calls LLMs when outputs are non-deterministic — the testing pyramid, assertion strategies, golden datasets, and CI gating.