SLO Definer
Turn a vague reliability goal into concrete SLIs, SLOs, an error budget, and burn-rate alerts — service-level indicators measured at the user-facing boundary, targets over a rolling window, and a written policy for what happens when the budget runs out. Use when a service has no defined reliability target, when on-call is noisy and alert-fatigued, or before you commit to an SLA you can't measure.
npx agentscamp add skills/slo-definerInstall to ~/.claude/skills/slo-definer/SKILL.md
"Make it reliable" is not a target you can measure or alert on. This skill turns it into SLIs (good-event ratios at the user boundary), SLOs (a target over a rolling window), an error budget with a written spend policy, and burn-rate alerts that page when the budget is at risk — not when a host metric blips.
"Make it reliable" can't be measured, can't be alerted on, and can't tell you when to stop shipping. This skill converts a reliability intention into four artifacts that can: SLIs that measure what users actually experience, SLOs that set a target over a window, an error budget with a written policy for spending and exhausting it, and burn-rate alerts that page when the budget is genuinely at risk. The output is a spec, not a dashboard — a contract the team and on-call can both point at.
When to use this skill
- A service is "important" but has no defined reliability target, so nobody can say whether last week was good or bad.
- On-call is drowning in pages that don't correspond to user pain — alert fatigue from threshold blips on CPU, memory, or a single 5xx.
- You're about to sign an SLA and need an internal SLO (tighter, measurable) to back it before you promise anything externally.
- You have dashboards full of metrics but can't answer "are users having a good time right now, and how much room do we have left to break things?"
Instructions
- Identify the user and the boundary first. An SLI measures the experience of a consumer (end user, calling service) at a specific boundary — the load balancer, the API gateway, the client SDK. Measure as close to the user as you can: a 200 at the app server while the CDN returns 502s is a lie. Name the boundary explicitly before picking metrics.
- Pick the few SLIs that reflect that experience. Choose from the request/response SLI families: availability (good-event ratio: non-5xx, non-timeout responses ÷ total valid requests), latency (fraction of requests served under a threshold at a percentile), and for data systems freshness (fraction of reads no older than N seconds) or correctness/coverage. Two or three SLIs per service is plenty — more dilutes the signal.
- Write each SLI as an explicit good-event criterion. Spell out what counts as a good event, what's in the denominator, and what's excluded. Example:
latency SLI = (requests with TTFB < 300ms) / (all non-400 requests at the gateway). Exclude client errors (4xx) and load-test traffic from the denominator — they aren't the service failing — but say so in writing. - Set the SLO as a target over a rolling window grounded in user need. Format: "X% of [good events] over [rolling window]" — e.g.
99.9% of requests succeed over 28 days. Use a rolling window (28 days is common) rather than calendar months so the number can't be gamed by a quiet week. Pick the lowest target users genuinely won't notice; if you can't justify the extra nine from user impact, don't pay for it. - Derive the error budget and write its spend policy. The budget is
1 − SLOover the window: a 99.9% SLO allows 0.1% bad events — for 28 days that's ~40 minutes of total unavailability, or 0.1% of requests. State who may spend it (experiments, risky migrations, planned maintenance all draw down the same budget) and the exhaustion rule in writing: when the budget is gone, risky changes freeze and reliability work takes priority until the window recovers. A budget with no consequence is just a number. - Tie alerts to burn rate, not to thresholds. Alert on how fast the budget is being consumed relative to the window. Run two: a fast-burn alert (e.g. 14.4× burn over 1 hour = ~2% of a 28-day budget gone in an hour → page now) and a slow-burn alert (e.g. ~3× burn over 6 hours → ticket, not a page). This makes a page mean "the budget is at risk," with high precision and low noise, instead of "5xx crossed 5 for 30 seconds."
- Sanity-check against history before committing. Read recent latency/error data (logs, metrics exports) and confirm the proposed SLO is currently achievable and meaningful — not already breached every week (unattainable, so it'll be ignored) and not trivially met with 100× headroom (no signal). Adjust the target to the real distribution.
WARNING
A 100% SLO is a trap: it leaves zero error budget, so every deploy is a potential breach and the only "safe" move is to never change the system. The gap below 100% is precisely the room you have to ship, experiment, and do maintenance — design it in deliberately.
WARNING
Averages hide the tail. A 200ms average latency is consistent with 5% of users waiting 4 seconds — and the tail is where users churn. Always state latency SLIs as a percentile (p95/p99 served under a threshold), never as a mean.
NOTE
System metrics are not SLIs. CPU, memory, disk, and queue depth are causes, useful for debugging, but a user never files a ticket about your CPU. SLIs live at the user-facing boundary; keep host metrics on the diagnosis dashboard, out of the SLO spec.
Output
A reliability spec containing: (1) SLI definitions — for each, what's measured, the boundary it's measured at, and the exact good-event criterion (numerator/denominator + exclusions); (2) SLO targets — the percentage and rolling window per SLI, with the user-impact rationale; (3) the error budget — 1 − SLO translated into concrete allowance (minutes and/or request count over the window) plus the written spend-and-exhaustion policy; and (4) the burn-rate alert thresholds — fast-burn (page) and slow-burn (ticket) multipliers and look-back windows. Reproducible: the same spec can be re-derived and re-checked against fresh data each quarter.
Frequently asked questions
- What's the difference between an SLI, an SLO, and an SLA?
- An SLI is the measurement (e.g. fraction of requests served < 300ms). An SLO is the internal target for that SLI over a window (e.g. 99.9% over 28 days). An SLA is an external, contractual promise — usually looser than the SLO, with penalties — that you should only sign after the SLO has held in production.
- Why not just aim for 100% reliability?
- A 100% SLO leaves zero error budget, so every deploy, migration, or experiment is a potential breach and the rational move is to stop changing the system. Pick the lowest target users won't notice (often 99.9%); the gap below 100% is the budget that buys you the freedom to ship.
Related
- Structured Logging DesignerDesign a structured (JSON) logging strategy with a stable field schema, correlation-ID propagation, and a disciplined level policy — then migrate ad-hoc string logs toward it. Use when logs are unsearchable plain text, when debugging a request across services means grepping multiple log streams by hand, or when standing up logging for a new service.
- Load Test DesignerDesign a defensible load test — a realistic workload model, a deliberate test type, and SLO-tied pass/fail thresholds — instead of a meaningless tight-loop script that hammers one endpoint. Use when validating capacity or SLOs before a launch or scaling event, when sizing infrastructure, or when an existing load test reports averages that nobody trusts.
- Rate Limiter DesignerDesign and implement API rate limiting that actually holds under load — pick the algorithm (token bucket vs sliding-window-counter vs fixed window) and justify it, choose the limiting key and per-tier limits, use cross-instance atomic storage, and return standard 429 signals. Use when protecting an API from abuse or scrapers, enforcing per-tier quotas, or replacing an in-memory limiter that breaks behind multiple replicas.
- Runbook WriterWrite an operational runbook a half-asleep on-call engineer can execute at 3am — scoped to ONE alert, leading with how to confirm the problem, the copy-pasteable mitigation that stops user pain, then diagnosis, escalation, and verification. Use when an alert has no documented response, after an incident exposed a missing procedure, or when standing up on-call for a service.
- Alerting Rules TunerCut alert noise and make every page mean something — rewrite alerting rules to fire on user-felt symptoms (error rate, latency SLO burn, failed requests) instead of causes (high CPU, full disk), with duration windows and severity routing so only urgent, actionable conditions reach a human. Use when on-call is fatigued by low-value pages, when real incidents get missed in the noise, or when alerts fire on causes rather than impact.
- Dashboard DesignerDesign a service dashboard that answers one question at a glance — is the service healthy, and if not, where's the problem? — by structuring panels around RED/USE instead of dumping every metric. Use when a service has no dashboard, when the existing one is an unreadable metric wall, or during incident-readiness prep.
- Distributed Tracing InstrumenterInstrument a service (or a chain of services) with OpenTelemetry so a single request can be followed end-to-end — context propagated across every hop including async/queue boundaries, spans at the boundaries that matter, deliberate trace-wide sampling, and trace_id stamped on log lines. Use when latency or failures span multiple services, when you have logs but can't reconstruct a request's full path, or when adopting OpenTelemetry.
- Canary Release PlannerDesign a canary / progressive rollout so a bad release reaches 1% of users instead of 100% — staged traffic with bake times, gating metrics compared against the concurrently-running stable baseline, and automated promote-or-rollback. Use when shipping a risky change, when you want automatic rollback on regression, or when moving off all-at-once deploys.