Software Application source ↗ source url updated Wed Jun 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

vLLM (PagedAttention)

vLLM is “a high-throughput and memory-efficient inference and serving engine for LLMs” (UC Berkeley Sky Computing Lab, now community-governed). It’s the concrete serving engine behind several mechanisms this wiki describes abstractly — notably the production form of continuous-batching and the kv-cache-management technique PagedAttention — and it doubles as a near-complete catalogue of the datacenter inference regime: nearly every lever in synthesis ships in one codebase. Sources: the official vLLM GitHub repo (primary, this refresh) + the vLLM docs.

PagedAttention — the headline

PagedAttention applies OS-style virtual-memory paging to the kv-cache: instead of one contiguous allocation per request, KV blocks are non-contiguous pages, giving “near-zero fragmentation” and enabling “KV-cache sharing” across requests (e.g. shared prompt prefixes). This directly attacks the memory pressure the KV-cache memory problem identifies as what makes serving hard — more concurrent requests fit in the same VRAM. The technique’s peer-reviewed origin and measured results live on paged-attention-paper (Kwon et al., SOSP 2023).

The full lever set (per the GitHub repo)

The repo lists vLLM as bundling, in one engine, essentially every optimization this wiki tracks:

Serving: continuous-batching of incoming requests, chunked prefill (splitting the prefill phase), and prefix caching (KV reuse across requests).
speculative-decoding: multiple draft strategies — n-gram, suffix, EAGLE, DFlash.
quantization: a wide format matrix — FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ, AWQ, GGUF — the data-type lever cutting across weights and KV.
Kernels & compilation: FlashAttention (flash-attention), FlashInfer, TRTLLM-GEN; piecewise and full CUDA/HIP graphs; graph-level transforms via torch.compile.
Parallelism: tensor, pipeline, data, expert (for MoE), and context parallelism.

So vLLM is where the wiki’s separate mechanism pages converge into a single production system — it embodies the datacenter regime of the two serving regimes (maximize GPU utilization across many concurrent users), the counterpart to llama-cpp‘s on-device regime.

Breadth: hardware & models

Hardware: NVIDIA & AMD GPUs, x86/ARM/PowerPC CPUs, Google TPUs, Intel Gaudi, and other accelerators — far beyond a single-vendor GPU engine.
Models: “200+ model architectures” — decoder-only LLMs, mixture-of-experts, multimodal (LLaVA, Qwen-VL), and embedding models.

Benchmark caveat — partly closed

The widely-cited figure — “continuous batching enables 23× throughput … while reducing p50 latency” — is the batching technique broadly, not vLLM alone, and the GitHub README states “state-of-the-art serving throughput” without first-party benchmark numbers. The peer-reviewed origin now supplies a defensible number: the PagedAttention paper (Kwon et al., SOSP 2023) measures 2–4× throughput over FasterTransformer and Orca at the same latency, widening “with longer sequences, larger models, and more complex decoding.” So PagedAttention’s specific contribution is grounded; a which-lever-bought-what decomposition across vLLM’s full stack (batching vs paging vs quantization vs speculative decoding) is still the spoke’s standing open question.

Accessible secondary

For a from-scratch walkthrough of the headline ideas (the OS-paging analogy, the “reserved-but-idle tokens” waste framing, prefix/beam-search sharing), see how-does-vllm-work — a T3 explainer (Amit Shekhar / Outcome School). It adds no numbers; the measured results stay on paged-attention-paper.

kv-cache · continuous-batching · speculative-decoding · quantization · prefill-decode-kv-cache · llm-inference · flash-attention · llama-cpp · how-does-vllm-work