Set Perf Budget
Define and enforce a cost and latency budget for an LLM feature or endpoint — set p95/p99 latency and cost-per-request ceilings, instrument to measure them against real traffic, and wire a check that fails when the budget is breached.
/set-perf-budget<the LLM endpoint/feature to budget, plus any target numbers (e.g. 'chat API, p95 < 2s, < $0.02/req')>Install to ~/.claude/commands/set-perf-budget.md
Scope
Treat $ARGUMENTS as the LLM feature or endpoint to put a budget around — and any target numbers the user gave. The job is to turn "it should be fast and cheap" into explicit, measured ceilings that a build or monitor can enforce, so cost and latency can't regress silently. A budget nobody checks is a wish; this command produces one that fails loudly.
NOTE
This sets and enforces the budget. To then find and cut what's over budget, hand off to the llm-cost-optimizer agent; for the techniques behind the targets, see LLM Cost and Latency Engineering.
Step 1 — Pin the budget numbers
Settle the ceilings before measuring anything:
- Latency — p50/p95/p99 targets (budget the tail, p95/p99, not the average — users feel the tail). Distinguish total time from time-to-first-token for streamed responses.
- Cost — a cost-per-request ceiling, and/or a daily/monthly spend cap for the feature.
- Scope — which endpoint/feature/model this budget covers, since different routes warrant different budgets.
If the user didn't give numbers, propose defaults from the feature's UX (interactive vs. batch) and current measured baseline, and state them explicitly.
Step 2 — Establish the baseline
Measure current cost and latency against representative traffic — real prompt/response sizes and concurrency, not a single warm request. Pull from existing observability/traces (Helicone, Portkey, or your logs) where available. Report p50/p95/p99 and cost-per-request as they stand, so the budget is grounded in reality and you know the gap.
Step 3 — Instrument the metrics
Ensure the numbers are actually captured per request: latency (and time-to-first-token), input/output tokens, and computed cost. If instrumentation is missing, add the minimal measurement needed — you can't enforce a budget you don't record.
Step 4 — Wire the enforcement
Make the budget fail loudly when breached, at the right gate:
- CI / pre-merge — a latency/cost regression test over a representative sample that fails the build when p95 or cost-per-request exceeds the ceiling.
- Runtime — alerts or guardrails on p95/p99 and on the daily/monthly spend cap (gateway budgets and rate limits can hard-stop runaway cost).
Pick the gate that matches the risk: regression-prone code → CI; runaway-spend risk → runtime caps.
Step 5 — Document the budget
Record the ceilings, where they're enforced, the current baseline vs. target, and what to do on a breach (route to the llm-cost-optimizer). A budget that lives only in someone's head isn't enforced.
WARNING
Budget the tail, not the mean. An average latency under target hides the p99 requests that make users churn — and an average cost hides the expensive outlier prompts that dominate the bill. Set and enforce p95/p99 and per-request ceilings, not just the average.
Related
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
- LLM Cost OptimizerUse this agent to cut the cost and latency of an application's LLM API usage without losing quality — audit where the tokens and dollars go, then apply caching, model right-sizing, prompt trimming, batching, and budgets, proven against an eval bar. Examples — "our OpenAI bill tripled, find where the spend is and cut it", "this endpoint's p95 is 8s, bring it down", "right-size models per task and add prompt caching to our chat feature".
- Prompt Cache OptimizerRestructure an LLM call to maximize prompt-cache hit rate and add response/semantic caching — move the stable prefix (system prompt, instructions, few-shot, context) to the front and variable input to the end, set cache breakpoints, and measure the hit rate and savings. Use when repeated calls share large common context and token cost or latency is too high.
- LLM Gateways Compared: Portkey vs Helicone vs LiteLLM for Caching & Cost ControlHow Portkey, Helicone, and LiteLLM compare for caching, cost control, and observability — each one's 2026 status and which fits self-hosted vs. hosted.