LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 Budgets
A practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
LLM cost and latency are usually concentrated in a few prompts, routes, and model choices — so measure first, then cut where it pays. The levers: prompt caching for stable prefixes, response/semantic caching for repeated queries, per-task model right-sizing, token trimming, streaming for perceived speed, and enforced p95 and cost budgets — always against an eval bar so cheaper never means worse.
Steps at a glance
- Measure and attribute the spend. Before changing anything, attribute cost and latency to specific calls, prompts, and routes: input vs. output tokens, calls per feature, p50/p95/p99, and dollars per request. Use observability (Helicone, Portkey, or your traces). Optimization without measurement is guessing.
- Cache the repeats. Turn on provider prompt caching for stable prefixes (system prompt, instructions, few-shot, long context), and add response or semantic caching for repeated and near-duplicate queries. Caching is usually the single biggest cost-and-latency win when calls share context.
- Right-size the model per task. Most requests don't need the biggest model. Route easy or structured tasks to a smaller, cheaper, faster model and reserve the frontier model for the hard slice — a cascade or router — checking each task against its eval bar before switching it down.
- Trim the tokens. Shorten verbose system prompts, prune low-value few-shot examples, cap max_tokens to what the task needs, and stop sending context the model doesn't use. Input tokens are billed every call, so a trimmed prompt pays back on every request.
- Cut the latency users feel. Stream tokens so output renders progressively instead of after a long blocking wait, parallelize independent calls, and set timeouts. Separate wall-clock cost from perceived latency — streaming fixes the feel, caching and routing fix the clock.
- Set and enforce budgets. Define p95/p99 latency and cost-per-request ceilings and wire a check that fails when they're breached — a CI regression test, a runtime alert, or gateway budgets and rate limits that hard-stop runaway spend. A budget nobody enforces is a wish.
- Verify quality held. Re-run your eval set after each change and report the cost and latency delta together with the quality delta. A 60%-cheaper system that's less accurate isn't an optimization — only ship the cut if the answers are still right.
Key takeaways
- Cost and latency are concentrated — a few prompts, routes, and model choices dominate. Measure and attribute first; optimize the biggest line items, not the easiest.
- Caching is the highest-leverage lever: provider prompt caching for stable prefixes, plus response/semantic caching for repeated or near-duplicate queries.
- Right-size the model per task — route the easy, structured majority to a smaller, cheaper, faster model and reserve the frontier model for the hard slice.
- Budget the tail, not the mean: enforce p95/p99 latency and cost-per-request ceilings, because averages hide the requests that actually hurt.
- Input tokens are billed on every call, so trimming bloated system prompts, few-shot, and unused context compounds across all your traffic.
- Re-check quality against an eval set after every cut — a cheaper, faster system that's less accurate is a regression, not a win.
LLM cost and tail latency feel like vague, ever-growing problems, but they almost never are: they're concentrated. A handful of prompts, a couple of routes, and one or two model choices usually account for most of the bill and most of the slow requests. So the discipline is the same as any performance work — measure and attribute first, then cut where it pays, and prove you didn't break quality doing it. This is the playbook.
Measure before you cut
You can't optimize what you can't see. Attribute cost and latency to specific calls: input vs. output tokens, calls per feature, p50/p95/p99, and dollars per request. Observability tooling (Helicone, Portkey, or your own traces) turns "the bill is too high" into "these three prompts are 70% of spend." Without that, every change is a guess.
The levers, in order of leverage
Caching — usually the biggest win
When calls share context, caching beats everything else. Two kinds:
- Provider prompt caching discounts a repeated prefix — the stable system prompt, instructions, few-shot, or long context that's identical across calls. Order the prompt static-first so the cacheable prefix is as long as possible (the prompt-cache-optimizer skill does exactly this).
- Response / semantic caching serves a stored answer for an exact-repeat (or, semantically, a near-duplicate) query, skipping the model entirely. Scope the key and TTL carefully — a cache that serves a stale or wrong answer is a correctness bug.
Right-sizing — stop overpaying per request
Most requests don't need the frontier model. Route the easy, structured, or high-volume majority to a smaller, cheaper, faster model and reserve the strongest model for the hard slice — a cascade or router. Validate each downshift against an eval set; "cheaper" that drops accuracy isn't cheaper once you count the retries and bad outputs. See Choosing the Right Model.
Token trimming — pay less on every call
Input tokens are billed every single call, so a bloated prompt is a recurring tax. Shorten verbose system prompts, prune low-value few-shot examples, cap max_tokens, and stop shipping context the task never uses. Small per-call savings compound hard across real traffic.
Perceived latency — fix what the user feels
Not all latency is equal. Stream tokens so output renders progressively instead of after a long blocking wait, parallelize independent calls, and set timeouts. Streaming doesn't make the request finish sooner — it makes it feel fast, which is often what actually matters.
Budget the tail, then enforce it
WARNING
Budget p95/p99 and cost-per-request, not the average. An average latency under target hides the slow requests that make users churn, and an average cost hides the outlier prompts that dominate the bill. Set explicit ceilings and make them fail loudly — a CI regression test, a runtime alert, or gateway budgets and rate limits that hard-stop runaway spend. The set-perf-budget command scaffolds this.
Never trade cost for quality blind
Every cut — a smaller model, a shorter prompt, an aggressive cache TTL — is a hypothesis about quality. Re-run your eval set after each change and report the cost and latency delta together with the quality delta. A system that's 60% cheaper and quietly less accurate is a regression you shipped on a spreadsheet.
Putting it together
Measure and attribute → cache the repeats → right-size per task → trim tokens → fix perceived latency → set and enforce budgets → re-verify quality. The llm-cost-optimizer agent runs this loop end-to-end. The single biggest structural decision is where these levers live: doing them per-app is fine at small scale, but a gateway centralizes caching, routing, and budgets across all your traffic — compare the options in LLM Gateways Compared, and see Calling Any Model for the unified-access layer underneath.
Frequently asked questions
- How do I reduce LLM API costs?
- Measure where the money goes first — attribute cost to specific prompts, routes, and models — then attack the biggest line items in this order: cache repeated calls (provider prompt caching for stable prefixes, response/semantic caching for repeated queries), right-size the model per task (send the easy majority to a cheaper, smaller model), and trim input tokens (shorter system prompts, fewer few-shot examples, capped max_tokens). Each change should be re-checked against an eval set so you don't trade cost for quality. Cost is almost always concentrated, so a few targeted fixes usually recover most of the spend.
- What is prompt caching and how much does it save?
- Prompt caching lets a provider reuse the computation for a repeated prefix of your prompt — the stable system prompt, instructions, few-shot examples, or long context that's identical across calls. When the prefix hits the cache, those input tokens are billed at a steep discount and the response starts faster. Savings depend on how much of your prompt is stable and reused within the cache window, but for workloads with a large fixed preamble (a long system prompt or a shared document) it can cut input-token cost substantially and lower time-to-first-token. The key is ordering the prompt static-first so the cacheable prefix is as long as possible.
- How do I lower LLM latency, especially p95?
- Attack it on two fronts. For perceived latency, stream tokens so output appears immediately and parallelize independent calls. For actual latency, cache (a cache hit is far faster than a model call), right-size to a faster model where quality allows, cut output length, and reduce the input you send. Then budget and monitor the tail (p95/p99), not the average — the slow requests are what users feel and what averages hide. Set a p95 ceiling and enforce it so latency can't regress silently.
- Should I always use the cheapest or smallest model?
- No — match the model to the task. The cheapest model that clears your eval bar for a given task is the right one for that task, but a model too weak for the hard cases will cost you in retries, bad outputs, and downstream errors that dwarf the token savings. The robust pattern is right-sizing per task: route the easy, structured, or high-volume requests to a small fast model and reserve the frontier model for the genuinely hard slice, verifying each routing decision against evals rather than assuming.
- Why measure p95 latency instead of the average?
- Because users experience the tail, not the mean. An average latency comfortably under target can hide a p99 of several seconds on a meaningful fraction of requests — and those slow requests are the ones that make users abandon. The same logic applies to cost: an average cost-per-request hides the expensive outlier prompts that dominate the bill. Budgeting and enforcing p95/p99 and per-request ceilings targets the requests that actually matter, instead of an average that can look fine while the experience is bad.
Related
- LLM Gateways Compared: Portkey vs Helicone vs LiteLLM for Caching & Cost ControlHow Portkey, Helicone, and LiteLLM compare for caching, cost control, and observability — each one's 2026 status and which fits self-hosted vs. hosted.
- Prompt Cache OptimizerRestructure an LLM call to maximize prompt-cache hit rate and add response/semantic caching — move the stable prefix (system prompt, instructions, few-shot, context) to the front and variable input to the end, set cache breakpoints, and measure the hit rate and savings. Use when repeated calls share large common context and token cost or latency is too high.
- LLM Cost OptimizerUse this agent to cut the cost and latency of an application's LLM API usage without losing quality — audit where the tokens and dollars go, then apply caching, model right-sizing, prompt trimming, batching, and budgets, proven against an eval bar. Examples — "our OpenAI bill tripled, find where the spend is and cut it", "this endpoint's p95 is 8s, bring it down", "right-size models per task and add prompt caching to our chat feature".
- Set Perf BudgetDefine and enforce a cost and latency budget for an LLM feature or endpoint — set p95/p99 latency and cost-per-request ceilings, instrument to measure them against real traffic, and wire a check that fails when the budget is breached.
- PortkeyAn AI gateway and LLMOps platform: route to many LLMs through one API with caching, retries, fallbacks, load balancing, guardrails, and full observability.
- HeliconeOpen-source LLM observability and AI gateway with one-line integration — logging, tracing, caching, and cost/latency tracking across providers.
- Calling Any Model: Unified LLM Gateways & SDKs in 2026Why teams put a unified layer in front of LLM providers — and how LiteLLM, OpenRouter, and the Vercel AI SDK compare for fallback and cost control.
- Choosing the Right Model: Haiku vs Sonnet vs OpusHow to pick the right Claude model tier for an agent or task.
- Voice Agent EngineerUse this agent to build or fix a real-time voice agent — the streaming STT → LLM → TTS pipeline, conversational (mouth-to-ear) latency, turn-taking, barge-in/interruptions, and per-stage provider selection. Examples — "our voice bot feels laggy and talks over people, fix the turn-taking and latency", "build a phone agent that transcribes, answers with our LLM, and speaks back", "get our voice agent's response time under a second".
- How to Build a Voice Agent: The STT → LLM → TTS PipelineHow to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.