Scholarly Article source ↗ source url updated Tue Jun 16 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Efficient Memory Management for LLM Serving with PagedAttention (Kwon et al., 2023)

The primary research paper behind vllm and its PagedAttention technique — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, SOSP 2023 (arXiv:2309.06180). The vllm page is anchored to the maintained engine repo; this page anchors the algorithm and its measured serving results to the peer-reviewed origin — and it is exactly the inference-serving controlled study the spoke’s open questions had been missing (the flash-attention-paper supplied the training-side numbers; this supplies the serving side).

The problem: KV-cache memory is the serving bottleneck

The paper’s framing is that the kv-cache “is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size.” Because serving throughput is gated by how many requests fit in GPU memory at once, wasted KV memory is wasted throughput — the concrete, measured form of the KV-cache memory problem the wiki states qualitatively.

The mechanism: OS paging for the KV cache

PagedAttention is “an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.” Instead of one contiguous KV allocation per request, the cache is split into fixed-size blocks (pages) that need not be contiguous, which buys two things the abstract names directly:

Near-zero waste in KV-cache memory (fragmentation is largely eliminated).
Flexible KV-cache sharing within and across requests — e.g. a shared prompt prefix, or the parallel candidates in parallel sampling / beam search, stored once instead of duplicated.

The measured result (what the open questions wanted)

2–4× throughput over the prior state of the art (FasterTransformer and Orca) at the same latency — the first peer-reviewed, serving-side controlled comparison in the spoke (vs the qualitative “23× from batching” figure the vllm page already flags as technique-broad, not vLLM).
The gain is “more pronounced with longer sequences, larger models, and more complex decoding algorithms” — i.e. exactly where KV memory pressure bites hardest, tying the throughput win back to the memory mechanism rather than a generic kernel speedup.

Why it matters for the spoke

This closes the residual half of the quantified-benchmarks open question that flash-attention-paper left open: FlashAttention benchmarked training; PagedAttention benchmarks inference serving with a same-latency throughput comparison against named baselines. It also grounds the kv-cache memory-math gap from the serving side — the cost the cache imposes is not just its size but the fragmentation around it, and paging is the fix. It is the primary source the vllm page leaned on by name, now a first-class node.

Tier

T1 — peer-reviewed primary (SOSP 2023). Distinct artifact from the vllm engine repo: paper = the algorithm + measured serving study; repo = the evolving production engine. Figures here are from the paper’s abstract; the arXiv PDF body (exact fragmentation percentages, per-model tables) was not fully extracted, so only abstract-grounded numbers are asserted.

vllm · kv-cache · continuous-batching · flash-attention-paper · prefill-decode-kv-cache · llm-inference