Rate Limiter Designer
Design and implement API rate limiting that actually holds under load — pick the algorithm (token bucket vs sliding-window-counter vs fixed window) and justify it, choose the limiting key and per-tier limits, use cross-instance atomic storage, and return standard 429 signals. Use when protecting an API from abuse or scrapers, enforcing per-tier quotas, or replacing an in-memory limiter that breaks behind multiple replicas.
npx agentscamp add skills/rate-limiter-designerInstall to ~/.claude/skills/rate-limiter-designer/SKILL.md
Most rate limiters fail in two ways: they live in process memory (so each replica enforces its own private quota) or they read-then-write without atomicity (so concurrent requests slip past the limit). This skill picks the right algorithm for the traffic shape, a key and limits per tier, cross-instance atomic storage, and standard 429 + RateLimit-* headers — then sketches the handler.
A rate limiter is only as correct as its storage and its atomicity. An in-memory counter behind three replicas enforces 3x the limit; a GET then SET without an atomic increment lets a burst of concurrent requests all read the same pre-increment value and pass. This skill makes the decisions explicit — algorithm, key, limits, storage, failure mode — and produces an implementation sketch that survives horizontal scaling and concurrency.
When to use this skill
- You're protecting an API (public, partner, or internal) from abuse, scrapers, credential stuffing, or runaway clients.
- You need per-tier quotas (free vs pro vs enterprise) or per-endpoint limits (cheap reads vs expensive writes/exports).
- You have an existing in-memory limiter and the service now runs more than one instance, so the effective limit drifts with replica count.
- A downstream dependency (a paid API, a database, an LLM provider) needs protecting from your own traffic spikes.
Instructions
-
Pick the algorithm from the traffic shape, and justify it. Three viable choices:
- Token bucket — refills at a steady rate, allows configurable bursts up to bucket capacity. Use for interactive/bursty clients (a user clicking fast, batch jobs) where occasional bursts are legitimate. Default choice for most APIs.
- Sliding-window counter — approximates a true sliding window by weighting the previous and current fixed windows. Use when you need smooth enforcement without burst spikes (protecting a fragile downstream). Cheap: two counters per key.
- Fixed window — one counter per key per interval. Use only when simplicity outweighs correctness; it permits up to 2x the limit across a window boundary (full quota at the end of window N plus full quota at the start of N+1). Never use it to protect something that genuinely caps at N. State which you chose and the burst/smoothness tradeoff that drove it.
-
Choose the limiting key — and prefer a composite. Options and their failure modes:
- IP — defeated by NAT (one office shares an IP → collateral throttling) and by rotating proxies. Use only for unauthenticated traffic.
- API key / authenticated user — the right granularity for quotas; ties the limit to identity, not network. Requires the limiter to run after auth.
- Composite (e.g.
user + endpoint, orapiKey + route-class) — lets expensive endpoints have tighter limits than cheap ones under the same identity. Pick the key per route class. Unauthenticated routes fall back to IP; authenticated routes key on identity.
-
Set limits per tier, written down as a table. Define explicit numbers: e.g. free = 60 req/min, pro = 600 req/min, enterprise = custom; expensive endpoints (export, search, LLM-backed) get their own lower limit. Don't invent one global number — the whole point is differentiation.
-
Use storage that is shared and atomic. The counter must live in a store all instances reach — Redis (or equivalent) — and the increment-and-check must be atomic. With Redis, use
INCR+EXPIREon the same key (or a single Lua script for token bucket, so read-refill-decrement is one atomic operation). AGETthenSETfrom application code is a race: concurrent requests read the same value and all pass. In-memory (Map, an LRU) is correct only for a single-process service and is otherwise a silent bug — each replica keeps its own private quota. -
Return standard signals. On limit exceeded, respond
429 Too Many Requestswith:Retry-After: <seconds>— when the client may retry.RateLimit-Limit,RateLimit-Remaining,RateLimit-Reset— emit these on every response (not just 429s) so well-behaved clients self-throttle before hitting the wall. Reset is seconds-until-reset (or a Unix timestamp — be consistent and document which).
-
Decide fail-open vs fail-closed when the store is down. This is a deliberate choice, not a default:
- Fail-open (allow when Redis is unreachable) — preserves availability; correct for limiters that protect against abuse where a brief gap is acceptable.
- Fail-closed (reject) — correct when the limit guards a hard resource cap (a paid downstream, a quota you're contractually bound to). Wrap the store call in a short timeout so a slow store doesn't hang every request; on timeout, apply the chosen policy.
-
Handle clock skew and bursts. Compute windows from the store's clock (e.g. Redis
TIME) or a single source, not each instance's wall clock — skewed instances otherwise disagree on window boundaries. For token bucket, set capacity = the largest legitimate burst and refill rate = the sustained limit; document both.
WARNING
Per-instance in-memory limiting in a horizontally-scaled deploy is the most common rate-limiter bug: with N replicas and a round-robin load balancer, the effective limit is roughly N x the configured value, and it changes silently when you autoscale. If the service has more than one replica, the limiter state MUST be in shared storage.
WARNING
Read-then-write without atomicity defeats the limiter under exactly the load it exists to stop. Concurrent requests all read the pre-increment count and all pass. Use an atomic INCR (fixed/sliding window) or a single Lua script (token bucket) — never GET then conditional SET from app code.
NOTE
Don't rate-limit at the app when an upstream layer does it better. A CDN/WAF or API gateway (Vercel Firewall, Cloudflare, Kong) can enforce coarse IP limits at the edge before traffic reaches your origin; reserve app-level limiting for identity- and tier-aware quotas that need request context.
Output
A short design block stating: the chosen algorithm + rationale, the key per route class, a per-tier limits table, the storage mechanism (and why it's atomic + cross-instance), and the fail-open/closed policy with timeout. Followed by a concrete middleware/handler sketch that performs the atomic increment-and-check against the store, sets RateLimit-* headers on every response, returns 429 + Retry-After on breach, and applies the chosen failure policy when the store is unreachable.
Related
- Webhook Handler ScaffolderScaffold a robust inbound webhook handler that verifies the signature on the raw body first, dedupes on the provider's event id, acknowledges fast, and processes asynchronously — the four things naive handlers get wrong. Use when wiring up events from a third party (Stripe, GitHub, Shopify, Slack, Twilio), when a provider keeps retrying because your endpoint times out or 500s, or when duplicate events are double-charging or double-creating records.
- Auth Flow ReviewerRead-only review of authentication AND authorization flows — session/token model, cookie flags, CSRF, token rotation, password-reset/email-verification, OAuth redirect/state, and per-route object-level access checks — for exploitable gaps. Use before shipping login/session/token code, when adding a protected route or sharing-by-URL feature, or during a security pass. Reports findings by severity with location, impact, and the concrete fix; never edits code.
- Provider Fallback WrapperWrap LLM calls so a provider outage, rate limit, or timeout degrades gracefully — with multi-provider fallback, bounded retries with backoff, and timeouts. Use when an app depends on a single model/provider and needs production resilience.
- GraphQL Schema DesignerDesign a clean, evolvable GraphQL schema (SDL) that won't paint you into a corner — model the graph around domain types and their relationships rather than as RPC-over-GraphQL, set nullability deliberately, standardize lists with Relay connections, plan DataLoader batching for per-parent fields, and evolve by adding + @deprecated instead of versioning. Use when designing a new GraphQL API, reviewing an SDL, or migrating REST endpoints to a graph.
- Idempotency DesignerMake unsafe, retryable API operations idempotent so a client retry or a network hiccup can't double-charge, double-create, or double-send — design a client-supplied idempotency key, an atomic store-and-check (unique constraint or conditional write), in-flight conflict handling, and a retention policy. Use when a POST/mutation can be retried (payments, order creation, sends, webhooks), or when duplicate side effects have already shown up in production.
- Pagination DesignerDesign correct, scalable pagination (plus the filtering and sorting that ride with it) for a list endpoint — pick cursor (keyset) vs offset and justify it, define an opaque cursor with a unique tiebreaker so no row is skipped or repeated, return a consistent envelope, bound page size, and name the indexes the sort actually needs. Use when adding a list endpoint, when OFFSET pagination crawls on a large table, or when clients see duplicate or missing rows while paging.
- SLO DefinerTurn a vague reliability goal into concrete SLIs, SLOs, an error budget, and burn-rate alerts — service-level indicators measured at the user-facing boundary, targets over a rolling window, and a written policy for what happens when the budget runs out. Use when a service has no defined reliability target, when on-call is noisy and alert-fatigued, or before you commit to an SLA you can't measure.