Query Plan Analyzer
Read a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is.
npx agentscamp add skills/query-plan-analyzerInstall to ~/.claude/skills/query-plan-analyzer/SKILL.md
A slow query is rarely slow for the reason you assume. This skill reads the REAL plan (EXPLAIN ANALYZE + BUFFERS, not bare EXPLAIN guesses), works from the most expensive node outward, names the real culprit — Seq Scan, stale-stats row gap, Nested Loop blowup, disk Sort, an index defeated by a cast — gives the specific fix, and re-runs to prove the bad node is gone.
A slow query is almost never slow for the reason you'd guess from reading the SQL. The plan is the ground truth: it shows the database actually chose a Seq Scan over the 40-million-row table, actually fed 500,000 rows into a Nested Loop that estimated 5, actually sorted on disk because no index could supply the order. This skill pulls the real plan — EXPLAIN ANALYZE with BUFFERS, not bare EXPLAIN — reads it from the most expensive node outward, names the one node that's costing the time and why, and turns that into a specific fix: the index to add (with the right column order), the rewrite that makes the predicate sargable, or the ANALYZE that fixes the estimate. Then it re-runs the plan to prove the bad node is gone instead of declaring victory from theory.
When to use this skill
- One specific query (an endpoint, a report, a dashboard panel) is slow and you need the cause, not a vague "add some indexes."
- A query that was fast got slow after a data-volume change, a deploy, or a schema/index change.
- The planner is doing something surprising — a Seq Scan despite an index existing, or ignoring the index you just added.
- p99 latency on one query is high while the table and load look unremarkable, and you suspect the plan rather than the hardware.
- Before shipping a new query or a
migration-writerindex change, to verify the plan is what you intended.
Instructions
- Get the table shape and existing indexes before touching the plan. Read the schema for the queried tables: column types, the existing indexes and their column order, row counts (
SELECT reltuples FROM pg_class, or\d+), and whether stats are fresh (pg_stat_user_tables.last_analyze/n_mod_since_analyze). Grep the codebase for where the query is built so you tune the real SQL (including how parameters bind), not a hand-typed approximation. - Run the REAL plan with actual rows, timing, and I/O — never bare EXPLAIN. Use
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)in Postgres (ANALYZE FORMAT=TREE/EXPLAIN ANALYZEin MySQL 8+).ANALYZEexecutes the query and reports actual rows + per-nodeactual time;BUFFERSshows shared/local hits vs. reads (heavyread=means I/O, not CPU, is the cost). Run it 2–3 times so a cold-cache first run doesn't masquerade as a planning problem. For a write query, wrap it in a transaction andROLLBACKsoANALYZEdoesn't mutate data. - Read from the most expensive node outward — find where the time actually is. In the text plan,
actual time=start..endis cumulative and inclusive of children; the time a node adds is its end-time minus its children's. Find the deepest/innermost node whoseactual timeandloops × rowsdominate the total. That node — not the top of the plan — is what you fix. Note itsactual rows,loops, and theRows Removed by Filterline. - Check the estimate-vs-actual gap FIRST — a wide gap means stale stats, and that's the real bug. Compare each node's estimated rows (
rows=) toactual rows. A gap of more than ~10x (e.g. plans for 5 rows, processes 50,000) means the planner is choosing strategy on bad information — usually stale statistics. Fix this before adding any index: runANALYZE <table>;(orANALYZEthe whole DB) and re-pull the plan. Often the plan corrects itself once estimates are right, and an index you'd have added would have been the wrong one. - Match the symptom to the culprit, then to the fix:
- Seq Scan on a large table with a selective predicate → the predicate filters to few rows but there's no usable index. Add a b-tree on the filtered column(s). (A Seq Scan returning most of the table is correct — don't index it.)
- Nested Loop with high
loopsover many outer rows → the join is iterating per-row when it should batch. The cause is usually a bad row estimate (see step 4) or a missing join-key index; a corrected estimate or an index on the inner join column lets the planner pick a Hash/Merge Join. - Sort (especially
Sort Method: external merge Disk:) → the query sorts at runtime and spills to disk. A b-tree index in theORDER BYorder can supply rows pre-sorted, removing the Sort node entirely (and poweringLIMITearly-exit). - High
Rows Removed by Filter→ the database fetched far more rows than it kept; the filter ran after the scan instead of being pushed into an index. Move the discriminating column into the index so it's a condition, not a post-filter. - Heavy
Buffers: ... read=→ the working set isn't cached; a smaller/covering index reduces pages touched, or the data genuinely doesn't fit memory.
- Check index sargability — an index the predicate can't use is no fix at all. A b-tree is defeated by a function or cast on the column (
lower(email) = ?,date(created_at) = ?,col::text = ?), by a leading-wildcardLIKE '%x', and by anORacross different columns. The fix is a matching expression index (CREATE INDEX ... ON t (lower(email))), a rewrite to a range (created_at >= d AND created_at < d+1), orUNION-ing theORbranches — not a plain index on the raw column. - Order multi-column index columns for the predicate, then the sort. Put equality-predicate columns first (leftmost), then the range/inequality column, then
ORDER BYcolumns — so one index serves both the filter and the ordering. A column used only for a range can't have an equality column usefully placed after it. State the exactCREATE INDEXDDL, includingINCLUDEd columns if a covering index would turn an Index Scan into an Index-Only Scan. - Re-run
EXPLAIN ANALYZEafter the fix and confirm the bad node is gone. Apply the fix (in Postgres, build the indexCONCURRENTLYto avoid a write lock;migration-writercan wrap the DDL). Re-pull the plan and verify the offending node changed type (Seq Scan → Index Scan, Nested Loop → Hash Join, Sort → no Sort) and that totalactual timedropped. If the planner ignores the new index, runANALYZEand re-check sargability before concluding the index is wrong.
WARNING
Bare EXPLAIN shows the planner's guess, not reality — it never runs the query, so it can't reveal a Nested Loop that estimated 5 rows and processed half a million, or which node actually burned the time. Diagnose with EXPLAIN ANALYZE every time; tuning from estimates is how you add the wrong index.
WARNING
A wide estimated-vs-actual row gap (>10x) means stale statistics, and that is the root cause — fix it with ANALYZE before adding indexes. An index chosen to compensate for a bad estimate is often useless or harmful once the estimate is corrected, and you'll have shipped a write-amplifying index that the planner ignores.
NOTE
EXPLAIN ANALYZE executes the statement. For INSERT/UPDATE/DELETE, run it inside BEGIN; ... ROLLBACK; so diagnosis doesn't change data — and be aware it still fires triggers and acquires locks during the run.
Output
A short report with three parts:
- Annotated plan — the offending node quoted from the
EXPLAIN ANALYZEoutput, with itsactual rowsvs. estimate,loops,Rows Removed by Filter, andBuffers, plus a one-line statement of why it's the bottleneck (Seq Scan / stale-stats row gap / Nested Loop blowup / disk Sort / non-sargable predicate). - The specific fix — exact
CREATE INDEX ... CONCURRENTLYDDL with the column order justified, or the SQL rewrite, or theANALYZE <table>command. One concrete action, not a menu. - Before/after proof — total
actual timeand the changed node type from the re-run plan (e.g.Seq Scan 1240 ms → Index Scan 3 ms), confirming the bad node is gone rather than asserting it should be.
Frequently asked questions
- Why isn't bare EXPLAIN enough to diagnose a slow query?
- Bare EXPLAIN prints the planner's cost model GUESS — estimated rows and cost — but never runs the query, so it can't show where time actually went or whether the estimates were wildly wrong. EXPLAIN ANALYZE executes the query and reports actual rows and per-node timing, which is the only way to see a Nested Loop that estimated 5 rows but processed 500,000, or a node that the planner thought was cheap but dominated runtime. Always diagnose with ANALYZE; never tune from estimates alone.
- I added the index but the planner still does a Seq Scan — why?
- Either the index can't serve the predicate (a function/cast on the column like lower(email) or date(created_at), or a leading-wildcard LIKE '%x', is not sargable so a plain b-tree on the raw column is useless), the statistics are stale so the planner mis-estimates and thinks the scan is cheaper, or the query genuinely returns a large fraction of the table where a Seq Scan IS correct. Check sargability first, run ANALYZE, then re-EXPLAIN ANALYZE before assuming the index is wrong.
Related
- Postgres Index StrategistRecommend the right Postgres index for a query or workload — choosing B-Tree vs. GIN vs. BRIN vs. partial/covering/expression, checking for redundant or unused indexes, and verifying the choice against the query plan. Use when a query needs an index, when deciding an index type for jsonb/array/full-text/time-series data, or when auditing an over-indexed table.
- Connection Pool TunerSize and tune a database connection pool from the real constraint — the database's shared max_connections and its core count — so total connections (per-instance pool × instance count) stay safely under the cap and a too-large pool stops adding latency. Use when the app throws 'too many connections' or pool-acquire timeouts, when the DB is saturated by connection count, or when deploying to serverless.
- Migration WriterWrite a safe, reversible, zero-downtime database migration using expand-contract — add the new shape, backfill in batches, switch reads/writes, then drop the old — so every deploy stays compatible with the running app version. Use when adding or changing schema on a live system, renaming/dropping a column, adding NOT NULL or a foreign key on a large table, or when a migration risks locks, table rewrites, or an unrevertable step.
- Deadlock DiagnoserDiagnose a database deadlock from the engine's own deadlock report, reconstruct the lock cycle (A holds 1 wants 2, B holds 2 wants 1), name the root cause — almost always two code paths locking the same rows in different orders — and fix it with consistent lock ordering, shorter transactions, and a retry-the-victim safeguard. Use when the DB logs deadlock errors, when transactions intermittently fail under load, or when queries mysteriously block each other.