# LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 Budgets

> A practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.

LLM cost and latency are usually concentrated in a few prompts, routes, and model choices — so measure first, then cut where it pays. The levers: prompt caching for stable prefixes, response/semantic caching for repeated queries, per-task model right-sizing, token trimming, streaming for perceived speed, and enforced p95 and cost budgets — always against an eval bar so cheaper never means worse.

LLM cost and tail latency feel like vague, ever-growing problems, but they almost never are: they're **concentrated**. A handful of prompts, a couple of routes, and one or two model choices usually account for most of the bill and most of the slow requests. So the discipline is the same as any performance work — measure and attribute first, then cut where it pays, and prove you didn't break quality doing it. This is the playbook.

## Measure before you cut

You can't optimize what you can't see. Attribute cost and latency to specific calls: input vs. output tokens, calls per feature, p50/p95/p99, and dollars per request. Observability tooling ([Helicone](/tools/helicone), [Portkey](/tools/portkey), or your own traces) turns "the bill is too high" into "these three prompts are 70% of spend." Without that, every change is a guess.

## The levers, in order of leverage

### Caching — usually the biggest win

When calls share context, caching beats everything else. Two kinds:

- **Provider prompt caching** discounts a repeated *prefix* — the stable system prompt, instructions, few-shot, or long context that's identical across calls. Order the prompt **static-first** so the cacheable prefix is as long as possible (the [prompt-cache-optimizer](/skills/performance/prompt-cache-optimizer) skill does exactly this).
- **Response / semantic caching** serves a stored answer for an exact-repeat (or, semantically, a near-duplicate) query, skipping the model entirely. Scope the key and TTL carefully — a cache that serves a stale or wrong answer is a correctness bug.

### Right-sizing — stop overpaying per request

Most requests don't need the frontier model. Route the easy, structured, or high-volume majority to a smaller, cheaper, faster model and reserve the strongest model for the hard slice — a **cascade** or **router**. Validate each downshift against an eval set; "cheaper" that drops accuracy isn't cheaper once you count the retries and bad outputs. See [Choosing the Right Model](/guides/getting-started/choosing-the-right-model).

### Token trimming — pay less on every call

Input tokens are billed every single call, so a bloated prompt is a recurring tax. Shorten verbose system prompts, prune low-value few-shot examples, cap `max_tokens`, and stop shipping context the task never uses. Small per-call savings compound hard across real traffic.

### Perceived latency — fix what the user feels

Not all latency is equal. **Stream** tokens so output renders progressively instead of after a long blocking wait, **parallelize** independent calls, and set **timeouts**. Streaming doesn't make the request finish sooner — it makes it *feel* fast, which is often what actually matters.

## Budget the tail, then enforce it

> [!WARNING]
> Budget p95/p99 and cost-per-request, **not the average**. An average latency under target hides the slow requests that make users churn, and an average cost hides the outlier prompts that dominate the bill. Set explicit ceilings and make them fail loudly — a CI regression test, a runtime alert, or gateway budgets and rate limits that hard-stop runaway spend. The [set-perf-budget](/commands/perf/set-perf-budget) command scaffolds this.

## Never trade cost for quality blind

Every cut — a smaller model, a shorter prompt, an aggressive cache TTL — is a hypothesis about quality. Re-run your eval set after each change and report the **cost and latency delta together with the quality delta**. A system that's 60% cheaper and quietly less accurate is a regression you shipped on a spreadsheet.

## Putting it together

Measure and attribute → cache the repeats → right-size per task → trim tokens → fix perceived latency → set and enforce budgets → re-verify quality. The [llm-cost-optimizer](/agents/data-ai/llm-cost-optimizer) agent runs this loop end-to-end. The single biggest structural decision is *where* these levers live: doing them per-app is fine at small scale, but a **gateway** centralizes caching, routing, and budgets across all your traffic — compare the options in [LLM Gateways Compared](/guides/advanced/llm-gateways-compared), and see [Calling Any Model](/guides/concepts/calling-any-model-gateways) for the unified-access layer underneath.

---

_Source: https://agentscamp.com/guides/advanced/llm-cost-latency-engineering — Guide on AgentsCamp._
