KV Cache
The KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.
The KV cache is the stored attention state — the key and value vectors for every processed token — that lets a transformer generate each new token by attending over cached history instead of recomputing the entire context from scratch.
Without it, producing token N would mean reprocessing all N−1 prior tokens — quadratic waste. With it, inference splits cleanly: prefill computes the prompt's KV once, then each decode step computes one new token's attention against the cache. The trade is memory: KV state grows with context length × batch size, which is why long-context serving exhausts VRAM before compute, why serving engines like vLLM build their architecture around KV memory management, and why KV quantization and eviction schemes are an active frontier.
Two product-level features sit on top of it: prompt caching persists prefix KV state across requests for cost savings, and speculative decoding exploits cheap cache-backed verification to accept multiple drafted tokens per step.
Frequently asked questions
- Why does the KV cache matter for serving costs?
- Because it's the VRAM your context occupies. Every token in flight holds its keys and values in GPU memory, growing linearly with context length and concurrent requests — long-context serving is usually KV-memory-bound before it's compute-bound. Engines like vLLM exist largely to manage this memory well (PagedAttention), and tricks like KV quantization stretch it.
- Is prompt caching the same as the KV cache?
- Prompt caching is built on it. The KV cache is the in-memory structure during a single generation; provider-level prompt caching persists the KV state of a stable prompt prefix between requests, so repeated context skips recomputation entirely. One is mechanism, the other is product.
Related
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- Prompt CachingPrompt caching reuses the computed state of a repeated prompt prefix across requests — dramatically cutting cost and time-to-first-token for stable context.
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- Speculative DecodingSpeculative decoding speeds up generation: a small draft model proposes tokens, the large model verifies them in one parallel pass — same output, fewer steps.
- vLLM vs Ollama: Local Convenience or Serving Throughput? (2026)vLLM vs Ollama compared — developer-friendly local runtime vs high-throughput production inference engine. Concurrency, hardware, and when to graduate.