Defined Term mechanism updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

KV Cache

The KV cache (key–value cache) is the core optimization of the decode phase of llm-inference prefill-decode-kv-cache. During attention, each token produces a key (K) and value (V); without caching, generating token n would recompute K and V for all n prior tokens every step.

How it works

Prefill computes K/V for all prompt tokens in parallel and stores them — “prefill warms up the KV cache.”
Decode computes K/V for only the new token and appends it to the cache — “decode updates it.” Each step then attends over the cached K/V plus the new entry.

This drops the per-step attention cost from roughly O(n²) to O(n), which is what makes autoregressive generation practical.

Cost it shifts, not removes

The cache trades compute for memory: it grows linearly with sequence length × layers × heads and must be held in GPU memory for the whole generation. That memory pressure is a key reason serving needs smart scheduling — see continuous-batching, which manages many such per-request caches at once.

The pressure isn’t only the cache’s size but the fragmentation around it: when each request gets one contiguous allocation, dynamic growth/shrink wastes memory and caps the batch size. The PagedAttention paper (Kwon et al., SOSP 2023) is the primary source on this — it borrows OS paging to store the cache in non-contiguous blocks, cutting waste to near zero and letting requests share identical KV (a shared prefix, or parallel-sampling candidates). vllm is its production embodiment.

KV Cache

How it works

Cost it shifts, not removes

Linked from