Speculative Decoding
Speculative decoding speeds up generation: a small draft model proposes tokens, the large model verifies them in one parallel pass — same output, fewer steps.
Speculative decoding accelerates generation by pairing models: a small, fast draft model proposes a run of tokens, and the large target model verifies them all in a single parallel pass — accepting the correct prefix and fixing the first mistake.
It attacks the core bottleneck of inference: decode is sequential, one expensive step per token. Verification, though, is parallelizable — checking K proposed tokens costs about one large-model step. So if the draft model guesses well (and on predictable text like code it often does), you bank several tokens per expensive step, with provably identical output distribution — rejected guesses are replaced by what the big model wanted anyway.
It's one of a family of lossless or near-lossless serving accelerations — alongside KV-cache management and quantization — that engines like vLLM and the major API providers run beneath the surface; variants (self-speculation, multi-token prediction heads like Medusa/EAGLE-style approaches) trade draft-model overhead for built-in drafting. If you're serving models yourself, it's a standard tool on the inference engineer's bench.
Frequently asked questions
- Does speculative decoding change the model's output?
- No — that's its defining property. The large model verifies every drafted token and rejects any it wouldn't have produced, falling back to its own choice. Accepted-or-corrected, the final sequence is distributed exactly as if the large model generated alone; you trade nothing but the draft model's overhead.
- When does it actually speed things up?
- When the draft model agrees with the target often enough — predictable text (code, structured output, boilerplate) accepts long runs; high-entropy creative text accepts fewer. Speedups of 2–3x are common in the good cases. It's a serving-side optimization: providers and engines like vLLM apply it under the hood.
Related
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- KV CacheThe KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.
- QuantizationQuantization shrinks a model by storing weights in lower precision (8-, 4-, even 2-bit) — cutting memory and speeding inference at a small accuracy cost.
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- LLM Inference EngineerUse this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".