Speculative Decoding

Speculative decoding accelerates generation by pairing models: a small, fast draft model proposes a run of tokens, and the large target model verifies them all in a single parallel pass — accepting the correct prefix and fixing the first mistake.

It attacks the core bottleneck of inference: decode is sequential, one expensive step per token. Verification, though, is parallelizable — checking K proposed tokens costs about one large-model step. So if the draft model guesses well (and on predictable text like code it often does), you bank several tokens per expensive step, with provably identical output distribution — rejected guesses are replaced by what the big model wanted anyway.

It's one of a family of lossless or near-lossless serving accelerations — alongside KV-cache management and quantization — that engines like vLLM and the major API providers run beneath the surface; variants (self-speculation, multi-token prediction heads like Medusa/EAGLE-style approaches) trade draft-model overhead for built-in drafting. If you're serving models yourself, it's a standard tool on the inference engineer's bench.

Frequently asked questions

Does speculative decoding change the model's output?

No — that's its defining property. The large model verifies every drafted token and rejects any it wouldn't have produced, falling back to its own choice. Accepted-or-corrected, the final sequence is distributed exactly as if the large model generated alone; you trade nothing but the draft model's overhead.

When does it actually speed things up?

When the draft model agrees with the target often enough — predictable text (code, structured output, boilerplate) accepts long runs; high-entropy creative text accepts fewer. Speedups of 2–3x are common in the good cases. It's a serving-side optimization: providers and engines like vLLM apply it under the hood.

Frequently asked questions

Related