Mutation Test Runner
Measure whether a test suite actually catches bugs by running mutation testing — introduce small faults into the code and check which ones a test kills versus which slip through silently. Use when line coverage is high but bugs still ship, when you suspect tests assert weakly, or to find the exact assertions a suite is missing.
npx agentscamp add skills/mutation-test-runnerInstall to ~/.claude/skills/mutation-test-runner/SKILL.md
Coverage proves a line ran, not that a test would notice if it broke. This skill runs mutation testing — Stryker, mutmut, PIT — to plant small faults and see which the suite kills, reads each surviving mutant as a missing assertion, and outputs the precise tests to add, scoped to the diff so it actually finishes.
Line coverage tells you a line ran during a test. It does not tell you the test would fail if that line were wrong — a function can be 100% covered by an assertion-free test. Mutation testing closes that gap: it plants small faults in the code (flip > to >=, swap + for -, drop a statement, negate a condition) and re-runs the suite against each one. A mutant that makes a test fail is killed — the suite pins that behavior. A mutant that passes everything survives — no test noticed the code changed, so that behavior is unprotected. This skill runs a mutation tool, reads the survivors as a precise to-do list of missing assertions, and tells you exactly which tests to add to kill them.
When to use this skill
- Coverage is high (80–100%) but bugs still slip into production — the classic symptom of covered-but-unasserted code.
- You inherited or reviewed a suite and suspect the tests assert weakly (snapshot-only, no return-value checks,
toBeDefinedinstead oftoEqual). - A module is critical (auth, money, parsing, pricing) and you want proof the suite would catch a regression, not just that it touches the lines.
- You're hardening a specific change and want the missing assertions for that diff, not a repo-wide audit.
WARNING
100% line coverage with surviving mutants is the false confidence this skill exists to expose: the code runs in a test, but no assertion would fail if the code were wrong. A green coverage badge is not a green mutation score.
Instructions
- Pick the tool for the language — don't guess, check what's installed. Inspect deps and config first:
- JS/TS: Stryker (
@stryker-mutator/core, configstryker.conf.json/.mjs); it auto-detects Jest/Vitest/Mocha runners. - Python: mutmut (
mutmut run, config insetup.cfg/pyproject.toml) or cosmic-ray for larger suites. - Java/Kotlin: PIT (
pitest, Maven/Gradle plugin). Go: go-mutesting or gremlins. Ruby: mutant. C#: Stryker.NET. - If no tool is installed, recommend the standard one for the stack and stop there — do not silently add a dev dependency.
- JS/TS: Stryker (
- Scope the run to changed code — this is mandatory, not an optimization. Mutation testing re-runs the full suite once per mutant, so a repo can take hours. Target the diff or a single package: Stryker
--mutate "src/pricing/**/*.ts"(or--since mainon recent versions), mutmut--paths-to-mutate src/billing/, PITtargetClassesset to the changed package. State the chosen paths up front so the run is reproducible. - Run and collect the surviving mutants, not the summary number. Execute the tool and read its detailed report (Stryker's
mutation.html/--reporter json, mutmutmutmut results+mutmut show <id>, PIT'smutations.xml). For each survivor capture: file, line, the original code, and the exact mutation that lived (e.g.boundary: changed >= to >orremoved call to logAudit()). - Triage each survivor: real gap or equivalent mutant. An equivalent mutant changes the code without changing observable behavior — e.g.
i <= n-1vsi < n, reordering commutative operations, mutating a value that's overwritten before use. These cannot be killed by any test; mark themequivalent — ignorewith a one-line reason and move on. Everything else is a genuine gap: a behavior your tests don't constrain. - For each real survivor, name the assertion that would kill it. This is the payoff. A survived
changed > to >=on a discount threshold means no test exercises the exact boundary — propose "applyDiscount(qty=10)where the rule isqty > 10: assert no discount at exactly 10." A survivedremoved call to audit()means nothing asserts the side effect — propose "assertauditLogreceived one entry aftertransfer()." Write the input and the expected behavior, not "add a test for line 42." - Group survivors by file and track the score where it's worth defending. Report the mutation score (killed / total non-equivalent) per scoped path as a baseline to hold or raise on critical modules, never as a vanity 100% target — chasing the last few percent usually means fighting equivalent mutants. Record the baseline so the next run can detect regressions.
NOTE
Two survivors that share a root cause often need one assertion. A function where every arithmetic and boundary mutant survives usually has a single test that calls it and asserts only that it didn't throw — adding one real return-value assertion can kill the whole cluster at once.
WARNING
If a mutation run "passes" with zero survivors but also shows mutants marked no coverage or timeout, the suite isn't strong — those mutants were never actually tested. No-coverage mutants are a coverage gap (hand them to coverage-gap-finder); timeouts often mean a mutant created an infinite loop the suite can't detect. Don't read them as kills.
Output
A survivor report grouped by file, plus the run scoping so it's reproducible:
Scope: src/billing/** (mutated 47 mutants, 90s)
Mutation score: 81% (34 killed / 42 non-equivalent) — baseline, hold >=80 on billing
src/billing/discount.ts
SURVIVED L23 changed `qty > 10` -> `qty >= 10` [BOUNDARY]
Gap: no test hits the exact threshold.
Add: applyDiscount({ qty: 10 }) -> assert price unchanged (no discount at boundary)
SURVIVED L31 removed call to `roundCents(total)` [STATEMENT]
Gap: nothing asserts the rounded result.
Add: applyDiscount({ qty: 12, price: 3.337 }) -> assert total === 33.37 (not 33.3696)
src/billing/invoice.ts
SURVIVED L58 changed `&&` -> `||` in isOverdue guard [LOGICAL]
Gap: only the both-true case is tested.
Add: isOverdue({ pastDue: true, paid: true }) -> assert false
EQUIVALENT L72 `i <= len-1` -> `i < len` — ignore (same iteration count)
No-coverage: 5 mutants in src/billing/legacy.ts -> route to coverage-gap-finder (not killed).
Each surviving line is a missing assertion; the Add: lines are concrete enough to hand straight to a test scaffolder. Re-run the same scope after adding them to confirm the survivors flip to killed and the score holds.
Frequently asked questions
- How is this different from line coverage?
- Coverage records that a line executed during a test. Mutation testing changes that line and checks whether any test fails — proving the test asserts on the behavior, not just touches it. 100% coverage with surviving mutants is exactly the false confidence this catches.
- Why not just run it across the whole repo?
- Mutation testing re-runs the suite once per mutant, so a full repo can take hours and quietly never get run. Scope every run to the changed files or one package; that keeps it fast enough to live in the loop.
Related
- Coverage Gap FinderRun the project's coverage tool and identify the highest-value untested paths — error branches, edge cases, and critical modules — then propose specific test cases for each gap. Use when you have a coverage report but don't know where new tests will pay off most.
- Property Test DesignerDesign property-based tests — generate hundreds of random inputs and assert invariants that must hold for ALL of them — to surface the edge cases hand-picked examples never reach. Use when code has a large input space (parsers, serializers, encoders, math, data transforms), when a bug keeps slipping through despite green example tests, or when you can't enumerate every case worth checking.
- Contract Test DesignerDesign consumer-driven contract tests between services so an API provider can't break its consumers unnoticed — without slow, flaky full end-to-end environments. Use when independent services or teams integrate over an API, when integration bugs only surface in staging or prod, or when E2E suites are too slow and brittle to catch breaking API changes.
- Agent Trajectory EvaluatorEvaluate a multi-step AI agent's whole run — tool calls, intermediate steps, and final result — not just final-answer correctness, so you can pinpoint WHERE it went wrong. Use when building or debugging a tool-using or multi-step agent, when final-answer-only evals can't explain failures, or when a prompt/model change quietly makes the agent less efficient or more error-prone even though the answer still looks right.