Spokes.wiki Search Graph Growth About

llm-inference-wiki

Defined Term mechanism source ↗ source url updated Tue Jun 09 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Speculative / assisted decoding

Speculative decoding (a.k.a. assisted generation) speeds up the latency-bound decode phase by having a small fast “draft” model propose several tokens, which the large model verifies in a single parallel forward pass, accepting the longest correct prefix. It directly exploits the spoke’s root fact — decode is memory-bandwidth-bound, not compute-bound — so verifying many tokens in one pass is nearly free. Source: Hugging Face.

Why it’s free correctness-wise

The big model still decides which tokens are accepted, so the output distribution is unchanged — it validates proposals rather than generating each token serially, turning latency “from O(n) to O(1) in the ideal case.

Numbers & requirements

“Up to 3× with INT8, ~2× otherwise”; “up to 10×” with memory offloading. Best when the draft model is “at least an order of magnitude smaller.” Hard constraint: the assistant must share the exact same tokenizer. This is the sampling × serving interaction the open questions flagged — the draft model couples token selection (token-sampling) to the serving loop.

prefill-decode-kv-cache · token-sampling · llm-inference · kv-cache · vllm