Prompt Caching
Prompt caching reuses the computed state of a repeated prompt prefix across requests — dramatically cutting cost and time-to-first-token for stable context.
Prompt caching is reusing the computation for a repeated prompt prefix across API requests: the provider stores the model's processed state for the stable beginning of your prompt, so subsequent requests pay full price only for what's new.
It exploits how inference works — processing a prompt builds an internal KV cache; if the next request begins with an identical prefix, that state is reusable. Providers expose this with large discounts on cached input tokens and sharply reduced time-to-first-token. For applications with heavy stable context — long system prompts, tool schemas, agent scaffolding, documents queried repeatedly — it's routinely the single biggest cost lever available, which is why agentic tools like Claude Code lean on it constantly.
The engineering is all prefix discipline: stable content first, volatile content last, byte-exact consistency (no timestamps, no reordered JSON keys upstream of the cache point), and TTL awareness so steady traffic keeps caches warm. Restructuring a call for maximum hit rate is precisely what the prompt-cache-optimizer skill does, inside the broader cost and latency playbook.
Frequently asked questions
- What actually gets cached in prompt caching?
- The model's internal computed state (the KV cache) for a prefix of your prompt — not the text, and not the response. When the next request starts with the exact same prefix, the provider skips recomputing it and starts where the cache ends. Cached input tokens are billed at a steep discount and processed near-instantly.
- Why does prompt structure matter for cache hits?
- Because caching is prefix-based and exact: everything before the first changed byte is cacheable, everything after is not. Put the stable parts first — system prompt, tool definitions, reference docs — and the variable parts (user question, latest messages) last. One timestamp or shuffled field at the top invalidates everything beneath it.
Related
- Prompt Cache OptimizerRestructure an LLM call to maximize prompt-cache hit rate and add response/semantic caching — move the stable prefix (system prompt, instructions, few-shot, context) to the front and variable input to the end, set cache breakpoints, and measure the hit rate and savings. Use when repeated calls share large common context and token cost or latency is too high.
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
- KV CacheThe KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- System PromptThe system prompt is the standing instruction layer an LLM receives before user input — defining its role, rules, tools, and tone for the whole conversation.
- LLM API Pricing in 2026: Every Major Model ComparedPer-million-token prices for Claude, GPT, Gemini, DeepSeek, Mistral, and Grok — plus caching and batch discounts — verified against vendor pricing pages.
- LLM Context Windows Compared (2026)Context windows and max output tokens across Claude, GPT, Gemini, DeepSeek, and Grok — the million-token era, what it costs, and what fits in practice.
- RAG vs Long Context: Do Million-Token Windows Kill Retrieval?Million-token context windows promised the end of RAG. The honest 2026 answer: long context changed where retrieval starts paying, not whether it does.
- Batch InferenceBatch inference processes many LLM requests asynchronously instead of one-at-a-time interactively — typically at ~50% discount via provider batch APIs.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.