# KV Cache

> The KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.

**The KV cache is the stored attention state — the key and value vectors for every processed token — that lets a transformer generate each new token by attending over cached history instead of recomputing the entire context from scratch.**

Without it, producing token N would mean reprocessing all N−1 prior tokens — quadratic waste. With it, [inference](/glossary/inference) splits cleanly: prefill computes the prompt's KV once, then each decode step computes one new token's attention against the cache. The trade is memory: KV state grows with context length × batch size, which is why long-[context](/glossary/context-window) serving exhausts VRAM before compute, why serving engines like [vLLM](/tools/vllm) build their architecture around KV memory management, and why KV quantization and eviction schemes are an active frontier.

Two product-level features sit on top of it: [prompt caching](/glossary/prompt-caching) persists prefix KV state across requests for cost savings, and [speculative decoding](/glossary/speculative-decoding) exploits cheap cache-backed verification to accept multiple drafted tokens per step.

---

_Source: https://agentscamp.com/glossary/kv-cache — Term on AgentsCamp._
