From Prompt to Prediction: Prefill, Decode, and the KV Cache
A code-first walk-through (MachineLearningMastery, intermediate; assumes familiarity with attention) of the two phases of transformer llm-inference and the cache that makes the second phase cheap.
What it covers
- Prefill phase. All prompt tokens are processed in parallel — each attends to itself and prior tokens via scaled dot-product attention, building contextual representations in one pass.
- Decode phase. Output tokens are generated one at a time, autoregressively. Each new token attends to the already-computed context; the model avoids recomputing it.
- kv-cache. Stores the keys and values computed during prefill; during decode the system appends only the new token’s K/V instead of recomputing K/V for all prior tokens, cutting per-step attention cost from O(n²) toward O(n).
Concrete example
PyTorch code demonstrating causal masking, simplified attention heads with explicit selection rules, context-vector computation during prefill, and KV-cache appending during decode.
Takeaway
“Prefill warms up the KV cache and decode updates it.” This phase split explains why LLMs ingest long prompts efficiently (parallel prefill) yet emit output token-by-token (sequential decode), and why caching is essential to practical serving — the foundation continuous-batching builds on.