Connection Pool Tuner
Size and tune a database connection pool from the real constraint — the database's shared max_connections and its core count — so total connections (per-instance pool × instance count) stay safely under the cap and a too-large pool stops adding latency. Use when the app throws 'too many connections' or pool-acquire timeouts, when the DB is saturated by connection count, or when deploying to serverless.
npx agentscamp add skills/connection-pool-tunerInstall to ~/.claude/skills/connection-pool-tuner/SKILL.md
Connection pools fail two ways: too large (every instance's pool sums past the DB's max_connections and exhausts it) or naively large (a pool bigger than the DB has cores just queues, adding latency). This skill sizes the per-instance pool from concurrency and core count, does the instances-times-pool math with headroom, sets the timeouts, and routes serverless through a pooler.
Connection pools fail in two opposite ways, and a "nice round number" like 100 walks into both. Too large and every app instance's pool sums past the database's shared max_connections, so the next deploy or traffic spike exhausts the server and every instance starts throwing. Naively large and the pool is bigger than the DB has cores to serve it, so the extra connections don't add parallelism — they queue inside the database and add latency. This skill sizes the per-instance pool from concurrency need and core count, does the instances × pool ≤ max_connections arithmetic with real headroom, sets the timeouts that recycle dead connections, and sends serverless through a pooler instead of multiplying pools.
When to use this skill
- The app logs
FATAL: too many connections/remaining connection slots are reserved, or pool-acquire timeouts ("timed out fetching a connection from the pool"). - The database is saturated by connection count (high
pg_stat_activityrows, memory pressure from per-connection backends) rather than by slow queries. - You scaled out app instances or autoscaling kicked in, and the DB started erroring even though per-instance load looks fine.
- You're deploying to serverless / many short-lived instances (Lambda, Vercel functions, Cloud Run) and need a connection strategy.
- Standing up a new service and picking a pool size before it hits production.
Instructions
- Find the real ceiling first. Read the database's
max_connections(PostgresSHOW max_connections, MySQLmax_connections) — this is shared across everything: every app instance, background workers, migrations, replicas, admin/psqlsessions, and the monitoring agent. Postgres also reservessuperuser_reserved_connections. Treat the usable budget as roughlymax_connections − reserved − headroom, not the raw number. - Count every connection source, not just the web app. Total connections = (per-instance pool × app instance count) + worker/cron pools + replicas + migration tooling + a margin for admin sessions and a deploy overlap (old and new instances live simultaneously during rolling deploys — pools effectively double for that window). Enumerate each source by grepping for pool config (
max,pool_size,maximumPoolSize,DATABASE_URL,?connection_limit=). - Size the per-instance pool from concurrency, capped by cores — not by a big round number. A connection only does work when the DB has a free core to run its query. The starting heuristic for a CPU-bound OLTP workload is near the DB's core count for the whole fleet, so per-instance pool ≈
(useful_DB_concurrency) / instance_count, often a small single-digit number. Going higher doesn't buy parallelism — it buys a queue. For I/O-bound queries (lots of waiting) you can go somewhat above core count, but measure rather than assume. - Do the exhaustion arithmetic explicitly and leave headroom. Compute
instances × pool + other_sourcesand confirm it stays under the usable budget at max autoscale, not at average instance count. Size against the ceiling the autoscaler can reach, then keep ~20–30% ofmax_connectionsfree for migrations, admin, replication, and deploy overlap. If the math doesn't fit, shrink the pool before raisingmax_connections(each Postgres backend costs real memory). - Set the four timeouts deliberately — defaults leak or stall.
- Acquire / pool timeout — how long a request waits for a free connection before failing fast (e.g. a few seconds). Without it, a saturated pool turns into unbounded queueing and looks like a hang.
- Idle timeout — return idle connections so the pool shrinks under low load and you're not holding slots the DB could give elsewhere.
- Max lifetime — recycle each connection after a bounded age (e.g. 30 min) so a load balancer / DNS failover / DB restart doesn't leave stale half-dead connections in the pool.
- Min / idle floor — keep a small warm minimum to avoid connect latency on the first request, but not so high that idle instances hoard the budget.
- Handle serverless and many-instances specially — route through a pooler. When instance count is large or unbounded (one pool per function invocation), per-instance pools multiply faster than any safe per-instance number can absorb. Don't fix it by shrinking the per-function pool to 1 alone — put a pooler between the app and the DB: PgBouncer in transaction mode, RDS Proxy, Supabase's pooler, or a provider serverless/HTTP driver. The pooler multiplexes hundreds of client connections onto a small set of real DB connections; keep the per-function pool at 1–2 behind it.
WARNING
Scaling out app instances silently multiplies total connections. A pool of 20 that's fine on 3 instances (60) exhausts a 100-connection DB the moment the autoscaler reaches 5 instances — and it fails everywhere at once, not gracefully. Always size against max autoscale × pool, plus the deploy-overlap doubling, never average instance count.
WARNING
A bigger pool is frequently slower, not faster. Past the DB's effective core count, added connections don't run in parallel — they queue inside the database and add context-switching overhead, raising p99 latency while throughput stays flat. If the pool is large and the DB is CPU-bound, the fix for latency is usually to shrink the pool.
NOTE
Transaction-mode poolers (PgBouncer) break features that hold state across statements on one connection: session-level SET, advisory locks, LISTEN/NOTIFY, and some prepared-statement modes. Use session mode (or a dedicated direct connection) for those paths, and run migrations against the DB directly, not through the transaction pooler.
Output
A pool-sizing recommendation, concretely:
- The math — usable budget (
max_connections − reserved − headroom), andinstances_at_max_autoscale × per_instance_pool + other_sourcesshown to land under it with the headroom stated. - Recommended per-instance pool size with the rationale (concurrency need vs. DB core count, and which workload type it is), plus separate sizes for worker/cron pools.
- Timeout/lifetime settings — acquire timeout, idle timeout, max lifetime, and min/idle floor, with the value and why each is set.
- Serverless recommendation if applicable — the specific pooler (PgBouncer transaction mode / RDS Proxy / serverless driver), the per-function pool size behind it, and any session-mode caveats for stateful paths.
Frequently asked questions
- Why is a bigger connection pool often slower?
- A connection only does work when the database has a free core to run its query. Once the pool exceeds roughly the DB's core count, extra connections don't run in parallel — they queue inside the database and add context-switching overhead, so p99 latency rises while throughput stays flat or drops. A smaller pool that matches core capacity is frequently faster.
- How do I size pools for serverless where instances scale to hundreds?
- Don't give each function a large pool — N instances × pool size exhausts max_connections instantly. Put a pooler (PgBouncer in transaction mode, RDS Proxy, or a provider's serverless/HTTP driver) between the functions and the DB, keep the per-function pool at 1–2, and let the pooler multiplex many client connections onto a small set of real DB connections.
Related
- Postgres Index StrategistRecommend the right Postgres index for a query or workload — choosing B-Tree vs. GIN vs. BRIN vs. partial/covering/expression, checking for redundant or unused indexes, and verifying the choice against the query plan. Use when a query needs an index, when deciding an index type for jsonb/array/full-text/time-series data, or when auditing an over-indexed table.
- Migration WriterWrite a safe, reversible, zero-downtime database migration using expand-contract — add the new shape, backfill in batches, switch reads/writes, then drop the old — so every deploy stays compatible with the running app version. Use when adding or changing schema on a live system, renaming/dropping a column, adding NOT NULL or a foreign key on a large table, or when a migration risks locks, table rewrites, or an unrevertable step.
- Structured Logging DesignerDesign a structured (JSON) logging strategy with a stable field schema, correlation-ID propagation, and a disciplined level policy — then migrate ad-hoc string logs toward it. Use when logs are unsearchable plain text, when debugging a request across services means grepping multiple log streams by hand, or when standing up logging for a new service.
- Deadlock DiagnoserDiagnose a database deadlock from the engine's own deadlock report, reconstruct the lock cycle (A holds 1 wants 2, B holds 2 wants 1), name the root cause — almost always two code paths locking the same rows in different orders — and fix it with consistent lock ordering, shorter transactions, and a retry-the-victim safeguard. Use when the DB logs deadlock errors, when transactions intermittently fail under load, or when queries mysteriously block each other.
- Query Plan AnalyzerRead a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is.
- Cold Start OptimizerCut cold-start latency for serverless functions and slow-booting apps by measuring the init breakdown, then attacking the dominant phase — artifact size, eager imports, eager connections, or under-provisioned memory — instead of reflexively buying provisioned concurrency. Use when serverless p99 spikes on the first request, when a function times out during init, or when scale-to-zero is hurting user-facing latency.
- Memory Leak HunterFind and fix a memory leak in a running app: confirm it's a real leak under steady load, diff two heap snapshots to name the growing object and its retention path, cut the root reference that blocks collection, and re-run to confirm memory plateaus. Use when RSS climbs until OOM/restart, heap grows unbounded across a steady workload, or GC pauses worsen the longer the process runs.
- Add CachingAdd a caching layer to one expensive function or endpoint correctly — confirm it's cacheable, design the cache key/TTL/layer/invalidation, handle stampedes, wrap the call in one place, and report the design.