KV Cache
The KV cache (key–value cache) is the core optimization of the decode phase of llm-inference prefill-decode-kv-cache. During attention, each token produces a key (K) and value (V); without caching, generating token n would recompute K and V for all n prior tokens every step.
How it works
- Prefill computes K/V for all prompt tokens in parallel and stores them — “prefill warms up the KV cache.”
- Decode computes K/V for only the new token and appends it to the cache — “decode updates it.” Each step then attends over the cached K/V plus the new entry.
This drops the per-step attention cost from roughly O(n²) to O(n), which is what makes autoregressive generation practical.
Cost it shifts, not removes
The cache trades compute for memory: it grows linearly with sequence length × layers × heads and must be held in GPU memory for the whole generation. That memory pressure is a key reason serving needs smart scheduling — see continuous-batching, which manages many such per-request caches at once.
The pressure isn’t only the cache’s size but the fragmentation around it: when each request gets one contiguous allocation, dynamic growth/shrink wastes memory and caps the batch size. The PagedAttention paper (Kwon et al., SOSP 2023) is the primary source on this — it borrows OS paging to store the cache in non-contiguous blocks, cutting waste to near zero and letting requests share identical KV (a shared prefix, or parallel-sampling candidates). vllm is its production embodiment.