Deadlock Diagnoser
Diagnose a database deadlock from the engine's own deadlock report, reconstruct the lock cycle (A holds 1 wants 2, B holds 2 wants 1), name the root cause — almost always two code paths locking the same rows in different orders — and fix it with consistent lock ordering, shorter transactions, and a retry-the-victim safeguard. Use when the DB logs deadlock errors, when transactions intermittently fail under load, or when queries mysteriously block each other.
npx agentscamp add skills/deadlock-diagnoserInstall to ~/.claude/skills/deadlock-diagnoser/SKILL.md
A deadlock is never random — it's a cycle the engine already mapped for you. This skill reads the real deadlock report (Postgres log / SQL Server graph), reconstructs who held and wanted which lock, names the inconsistent lock-acquisition order behind it, fixes the order (or shortens the transaction), and adds retry-the-victim so transient contention can't take the app down.
A deadlock looks random from the application — a transaction that worked a thousand times suddenly errors out under load — but the database already did the forensics for you. When the engine detects a cycle it picks a victim, rolls it back, and logs exactly who held what and waited on what. This skill reads that report instead of guessing: it pulls the Postgres deadlock log lines (or the SQL Server deadlock graph / innodb status in MySQL), reconstructs the cycle (A holds lock 1 and wants lock 2 while B holds 2 and wants 1), and names the real root cause — which is almost always two code paths acquiring the same rows or tables in different orders. Then it fixes the cause: enforce one consistent lock-acquisition order everywhere, shrink the lock window so the race rarely opens, and add a retry-the-victim safeguard for the deadlocks you can't design away — in that priority, because retries without ordering just trade a deadlock for a rollback storm.
When to use this skill
- The database log shows
deadlock detected(Postgres), a deadlock graph / error 1205 (SQL Server), orDeadlock found when trying to get lock(MySQL/InnoDB). - A transaction intermittently fails or auto-retries only under concurrency — fine in dev, flaky in production at peak.
- Two queries or endpoints mysteriously block each other, or you see processes stuck in a lock wait that times out.
- You're adding a write path that touches multiple rows/tables and want to confirm it locks in the same order as existing code before it ships.
- Lock contention (not a true cycle) is serializing throughput, and you need to tell genuine deadlocks apart from long lock waits.
Instructions
- Get the engine's deadlock report — don't reconstruct from app logs. In Postgres, read the server log around the error: it prints both processes, their full SQL statements, and the
Process N waits for <lockmode> on <relation/tuple>; blocked by process Mlines for each side of the cycle (raiselog_lock_waits = onanddeadlock_timeoutcontext if it's terse). In SQL Server, pull the deadlock graph from thesystem_healthExtended Events session or a trace — it lists eachprocesswith itsinputbuf(the statement) and theresource-listof locks owned vs. requested. In MySQL/InnoDB, runSHOW ENGINE INNODB STATUSand read theLATEST DETECTED DEADLOCKsection. This report is ground truth; the app's stack trace only tells you which transaction lost. - Reconstruct the cycle explicitly: who HELD what, who WANTED what. Write it out as a two-column picture —
Txn A: holds <lock on resource 1>, waits for <lock on resource 2>/Txn B: holds <lock on resource 2>, waits for <lock on resource 1>. Identify the exact resources (which rows/index ranges/tables) and the lock modes (rowFOR UPDATE/exclusive vs. shared, gap locks in InnoDB, intent locks in SQL Server). A real deadlock is a closed cycle of waits; if it's not a cycle, it's lock contention or a lock-wait timeout (step 8), which has a different fix. - Find the inconsistent acquisition ORDER — the usual root cause. Grep the codebase for every transaction that touches the resources in the cycle and trace the order each one locks them. The classic bug: one path does
UPDATE accounts WHERE id=1thenid=2, another doesid=2thenid=1(or two services lock tablesorderstheninventoryvs.inventorythenorders). Watch for ordering that's hidden — aSELECT ... FOR UPDATEwith an unorderedIN (...)or a join whose row-locking order depends on the plan, an ORM that emits writes in object-graph order, or a foreign-key check that takes a lock on the parent row you didn't write explicitly. - Fix the cause first: enforce ONE consistent lock-acquisition order across all transactions. Make every code path acquire the shared resources in the same deterministic order — sort the ids before locking (
SELECT ... FOR UPDATE ... ORDER BY id), always lock parent before child, always lock tables in a fixed documented sequence. Consistent ordering makes a cycle impossible: contenders queue instead of deadlocking. This is the only fix that actually removes the deadlock rather than reducing its odds. - Shrink the lock window so the race rarely opens. Keep transactions short and narrow: acquire locks as late as possible, commit as early as possible, and lock only the rows you'll write. Never hold a transaction open across a network/RPC/third-party-API call or across user think-time — an external call inside the transaction stretches the lock-hold from milliseconds to seconds and turns rare contention into constant deadlocks. Do the slow work before
BEGINor afterCOMMIT. - Pick a deliberate lock strategy for the access pattern, and right-size isolation. Where the same rows are contended, use pessimistic locking with
SELECT ... FOR UPDATEin the consistent order from step 4. Where conflicts are rare, prefer optimistic concurrency — aversion/updated_atcolumn checked in theWHEREof theUPDATEand a conflict-retry, which takes no long-held locks. If the engine is over-locking (e.g. Serializable or InnoDB gap locks causing deadlocks on inserts/range scans), drop to the lowest isolation level that's still correct (often Read Committed) to acquire fewer locks. - Add the retry-the-victim safeguard — last, not first. A deadlock victim's transaction is rolled back cleanly and is a transient, safe-to-retry error; the app should catch it specifically (Postgres
SQLSTATE 40P01, MySQL1213, SQL Server1205) and retry the whole transaction with capped exponential backoff and jitter (e.g. 3–5 attempts). Retry the entire transaction fromBEGIN— replaying half a rolled-back transaction corrupts state. This handles the deadlocks you can't design away; it does NOT substitute for steps 4–5. - Distinguish a true deadlock from plain lock contention before "fixing" the wrong thing. If the report shows a lock-wait timeout rather than a detected cycle, there's no ordering bug — one transaction is simply holding a lock too long (a long-running write, an idle-in-transaction connection, a missing index forcing a wide row/range lock). The fix there is shortening the holder (step 5), adding the index so the lock is narrow (
query-plan-analyzer), or killing idle-in-transaction sessions — not reordering locks.
WARNING
Adding retries WITHOUT fixing the inconsistent lock order just papers over the bug. Under load, every retry re-enters the same cycle, so you trade one deadlock for a storm of rollbacks and re-runs: throughput craters, latency spikes, and the database burns work undoing transactions. Fix the ordering first; the retry is a net for the residual, not the cure.
WARNING
A transaction that holds a lock across an external/API call (or user think-time) is the single most common way rare contention becomes constant deadlocks — the lock-hold goes from milliseconds to seconds, widening the race window enormously. Move every network call and slow computation outside the BEGIN ... COMMIT.
NOTE
Lowering isolation reduces locking but changes correctness guarantees (Read Committed allows non-repeatable reads; dropping below Serializable can reintroduce write skew). Only lower it where the access pattern is provably safe — don't trade a deadlock for a silent data anomaly.
Output
A short report with four parts:
- The reconstructed cycle — quoted from the engine's deadlock report:
Txn A holds <lock on R1>, wants <lock on R2>/Txn B holds <lock on R2>, wants <lock on R1>, with the exact resources, lock modes, and the two offending statements. - The root cause — the specific inconsistent lock-acquisition order (or over-long lock scope / over-strict isolation) behind the cycle, naming the two code paths and the resources they lock in conflicting order.
- The fix — one concrete change: the consistent ordering to enforce (with the exact
ORDER BY/ lock sequence), or the shortened-transaction change (what to move outsideBEGIN), or the isolation-level / locking-strategy change — not a menu. - The retry safeguard — the specific deadlock SQLSTATE/error code to catch and the backoff retry of the whole transaction, framed explicitly as the net for residual deadlocks, not the primary fix.
Frequently asked questions
- If I just retry the deadlock victim, isn't the problem solved?
- No — retries are a safety net, not the fix. Deadlocks can never be 100% eliminated, so every app SHOULD retry the victim with backoff because it's a transient, safe-to-retry error. But retries layered on top of inconsistent lock ordering only convert deadlocks into a storm of rollbacks and re-runs under load: throughput collapses, latency spikes, and you've hidden the real bug instead of fixing it. Fix the lock-acquisition order first; keep the retry as the last-resort guard for the deadlocks you genuinely can't design away.
- Why do two transactions deadlock when each touches the same rows?
- Touching the same rows is fine; touching them in different ORDERS is what deadlocks. If transaction A updates row 1 then row 2, and transaction B updates row 2 then row 1, they can each grab their first lock and then block forever waiting on the other's — a cycle. The same two transactions that always acquire rows in the same order (e.g. ascending by primary key) simply queue: one waits, the other commits, then the first proceeds. Consistent ordering turns a deadlock into a harmless short wait.
Related
- Query Plan AnalyzerRead a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is.
- Connection Pool TunerSize and tune a database connection pool from the real constraint — the database's shared max_connections and its core count — so total connections (per-instance pool × instance count) stay safely under the cap and a too-large pool stops adding latency. Use when the app throws 'too many connections' or pool-acquire timeouts, when the DB is saturated by connection count, or when deploying to serverless.
- Migration WriterWrite a safe, reversible, zero-downtime database migration using expand-contract — add the new shape, backfill in batches, switch reads/writes, then drop the old — so every deploy stays compatible with the running app version. Use when adding or changing schema on a live system, renaming/dropping a column, adding NOT NULL or a foreign key on a large table, or when a migration risks locks, table rewrites, or an unrevertable step.