Speculative / assisted decoding
Speculative decoding (a.k.a. assisted generation) speeds up the latency-bound decode phase by having a small fast “draft” model propose several tokens, which the large model verifies in a single parallel forward pass, accepting the longest correct prefix. It directly exploits the spoke’s root fact — decode is memory-bandwidth-bound, not compute-bound — so verifying many tokens in one pass is nearly free. Source: Hugging Face.
Why it’s free correctness-wise
The big model still decides which tokens are accepted, so the output distribution is unchanged — it
validates proposals rather than generating each token serially, turning latency “from O(n) to
O(1)” in the ideal case.
Numbers & requirements
“Up to 3× with INT8, ~2× otherwise”; “up to 10×” with memory offloading. Best when the draft model is “at least an order of magnitude smaller.” Hard constraint: the assistant must share the exact same tokenizer. This is the sampling × serving interaction the open questions flagged — the draft model couples token selection (token-sampling) to the serving loop.
Related
prefill-decode-kv-cache · token-sampling · llm-inference · kv-cache · vllm