# Speculative Decoding

> Speculative decoding speeds up generation: a small draft model proposes tokens, the large model verifies them in one parallel pass — same output, fewer steps.

**Speculative decoding accelerates generation by pairing models: a small, fast draft model proposes a run of tokens, and the large target model verifies them all in a single parallel pass — accepting the correct prefix and fixing the first mistake.**

It attacks the core bottleneck of [inference](/glossary/inference): decode is sequential, one expensive step per token. Verification, though, is parallelizable — checking K proposed tokens costs about one large-model step. So if the draft model guesses well (and on predictable text like code it often does), you bank several tokens per expensive step, with **provably identical output distribution** — rejected guesses are replaced by what the big model wanted anyway.

It's one of a family of lossless or near-lossless serving accelerations — alongside [KV-cache](/glossary/kv-cache) management and [quantization](/glossary/quantization) — that engines like [vLLM](/tools/vllm) and the major API providers run beneath the surface; variants (self-speculation, multi-token prediction heads like Medusa/EAGLE-style approaches) trade draft-model overhead for built-in drafting. If you're serving models yourself, it's a standard tool on the [inference engineer's](/agents/data-ai/llm-inference-engineer) bench.

---

_Source: https://agentscamp.com/glossary/speculative-decoding — Term on AgentsCamp._
