Spokes.wiki Search Graph Growth About

llm-inference-wiki

Tech Article source ↗ source url updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

From Prompt to Prediction: Prefill, Decode, and the KV Cache

A code-first walk-through (MachineLearningMastery, intermediate; assumes familiarity with attention) of the two phases of transformer llm-inference and the cache that makes the second phase cheap.

What it covers

Concrete example

PyTorch code demonstrating causal masking, simplified attention heads with explicit selection rules, context-vector computation during prefill, and KV-cache appending during decode.

Takeaway

“Prefill warms up the KV cache and decode updates it.” This phase split explains why LLMs ingest long prompts efficiently (parallel prefill) yet emit output token-by-token (sequential decode), and why caching is essential to practical serving — the foundation continuous-batching builds on.