Spokes.wiki Search Graph Growth About

llm-inference-wiki

log

Synthesis — LLM Inference

The evolving thesis of this wiki. Sits above the schema.org pages. Records the current best understanding, open questions, and explicitly flagged contradictions.

Current understanding

llm-inference — running a trained model on a prompt — factors cleanly into three layers, and the founding three sources (all MachineLearningMastery code walk-throughs, all landing 2026-06-01) each take one layer:

  1. Execution: two phases. Prefill processes the whole prompt in parallel; decode then emits tokens one at a time, autoregressively prefill-decode-kv-cache. This asymmetry is the root fact of inference economics: prefill is compute-bound and parallel; decode is sequential and latency-bound.
  2. Token selection. At each decode step the logits become a distribution (softmax, scaled by temperature) and a token is chosen via token-sampling — top-k or top-p logits-softmax-sampling-walkthrough. This is the cheap part, layered on top of the heavy attention compute.
  3. Serving at scale. continuous-batching keeps the GPU busy across many concurrent, variable-length requests by dynamically refilling batch slots and packing unpadded tokens with a block-diagonal mask continuous-batching-serving.

The unifying thread is the kv-cache. It is what makes decode cheap per step (O(n²)→O(n)) by storing past keys/values — and its memory footprint is precisely what makes serving hard, which is the problem continuous-batching exists to manage. So the three sources are not three topics but one pipeline seen at three altitudes: how a token is chosen → how the model runs to produce it → how that run is shared across users.

Open questions

  • Quantified, neutral benchmarks. Most claims here are from intermediate tutorials with toy/demo setups (e.g. the 6.5× continuous-batching figure, the O(n) KV-cache claim). Partly closed (2026-06-15): flash-attention-paper (Dao et al., NeurIPS 2022) is the first peer-reviewed primary with hardware/model-specific numbers — 15% on BERT-large, 3× on GPT-2, 2.4× on long-range arena, plus a formal IO-complexity proof. It benchmarks training, though. Serving side now grounded (2026-06-16): paged-attention-paper (Kwon et al., SOSP 2023) is the peer-reviewed inference-serving study — 2–4× throughput over FasterTransformer/Orca at equal latency, attributed to near-zero KV fragmentation. What remains open is narrower: a which-lever- bought-what decomposition across vLLM’s full stack (paging vs batching vs quantization vs speculative decoding), not the existence of any controlled serving number. (A new T3 explainer, how-does-vllm-work (Amit Shekhar / Outcome School, 2026-06-17), restates the PagedAttention/continuous-batching mechanics cleanly but ships no numbers — a useful on-ramp to vllm, not movement on this gap.)
  • Sampling × serving interaction. The sources treat sampling and batching independently. Does aggressive batching constrain per-request sampling (shared temperature, speculative decoding)? Unaddressed.
  • Memory math. kv-cache growth (length × layers × heads × precision) and its cap on batch size / context length is asserted qualitatively but never quantified here. (The attention side of the memory story is now grounded: flash-attention-paper proves attention memory is linear, not quadratic, in sequence length — the KV-cache decode-step byte math remains unquantified here, but its serving cost is now grounded: paged-attention-paper shows the binding problem is fragmentation of the per-request cache, not just its raw size, and that OS-style paging recovers the wasted memory as batch capacity.)
  • Beyond the basics. PagedAttention/vLLM, speculative decoding, FlashAttention now added (2026-06-09): vllm (PagedAttention = OS-paging for the kv-cache, near-zero fragmentation + KV sharing), flash-attention (IO-aware exact attention — the prefill/decode compute core), and speculative-decoding (draft-and-verify; the concrete sampling × serving coupling). Still absent: quantization (int4/FP8). Now added (2026-06-10): quantization (the data-type lever — int8/int4/FP8, GPTQ/AWQ/bitsandbytes/GGUF; shrinks weights and the kv-cache, the fourth production lever after the three algorithmic ones) and llama-cpp (its flagship embodiment). The additions reframe the pipeline: the founding three explained what each layer does; the 06-09 trio explained how production engines make each layer cheap; quantization adds the orthogonal make-the-numbers-smaller axis that cuts across all layers.

The two serving regimes (added 2026-06-10)

Inference now reads as two distinct regimes, not one. vllm + continuous-batching + flash-attention describe the datacenter regime, where the goal is maximizing GPU utilization across many concurrent users (the binding constraint is keeping an expensive GPU busy). llama-cpp

  • quantization describe the on-device / single-user regime, where the binding constraint is fitting the model in limited memory at all — solved not by batching but by shrinking precision (GGUF 2–8-bit). Georgi Gerganov’s llama.cpp (the core of Ollama/LM Studio) is the canonical instance. So the same pipeline (prefill/decode, KV cache, sampling) runs at both ends of the hardware spectrum, with batching the datacenter lever and quantization the edge lever — and quantization is the one that appears in both (KV-cache + weight quantization help the GPU regime too).

vLLM as the datacenter regime’s convergence point (refreshed from the GitHub repo, 2026-06-11). The vllm page, previously grounded only in the docs, is now anchored to its primary source — and the repo makes plain that vLLM is not one lever but all of them in a single engine: PagedAttention + continuous-batching + chunked prefill + prefix caching + speculative-decoding (n-gram/suffix/ EAGLE/DFlash) + quantization (FP8/MXFP4/NVFP4/INT8/INT4/GPTQ/AWQ/GGUF) + flash-attention kernels

  • tensor/pipeline/expert/context parallelism. So the wiki’s separate mechanism pages are not a list of alternatives — in production they stack inside the same system. vLLM also reaches well beyond NVIDIA (AMD, CPU, TPU, Gaudi) and 200+ model architectures incl. MoE & multimodal, so the datacenter regime is now a portable software stack, not a GPU-vendor story. This sharpens the standing benchmark question: with every lever bundled, an honest “which lever bought what” attribution needs a controlled, hardware- specified study — the README’s “state-of-the-art throughput” stays unquantified.

Contradictions flagged

None yet — the three founding sources are complementary, not competing.

Cross-spoke adjacency

  • research-wiki holds the model-substrate thread (capability & cost of frontier models: claude-opus-4-8, Anthropic). This wiki is the mechanism layer beneath that business/capability story — how inference actually runs and what it costs in compute and memory. The hub router split these deliberately: research-wiki is tools-for-thought / agentic products; this spoke is inference internals. Watch for sources that bridge them (e.g. inference cost driving product economics).
  • webperf-wiki shares a latency/efficiency sensibility (byte budgets there, GPU cycles here) but the domains don’t overlap in subject matter.

Index — LLM Inference Wiki

Catalog of all pages, grouped by @type. The spine: synthesis (thesis), log.md (history), this file (catalog).

DefinedTerm (concepts / mechanisms)

  • llm-inference — umbrella: turning a prompt into tokens with a trained model; the three-layer pipeline. · domain
  • token-sampling — logits → softmax → temperature → top-k / top-p; the per-step token choice. · mechanism
  • kv-cache — key/value cache; the decode-phase optimization (O(n²)→O(n), at a memory cost). · mechanism
  • continuous-batching — in-flight batching; serving-layer optimization for concurrent requests. · mechanism
  • flash-attention — IO-aware exact attention; tiling/fusion avoids the N×N matrix in HBM · source · mechanism
  • speculative-decoding — draft model proposes, big model verifies in parallel; same output distribution · source · mechanism
  • quantization — lower-precision weights/activations/KV (int8/int4/FP8; GPTQ/AWQ/GGUF); the data-type lever, the fourth production optimization · source · mechanism

SoftwareApplication (engines)

  • vllm — high-throughput inference/serving engine; PagedAttention (OS-paging for the KV cache); bundles every datacenter lever in one stack (200+ models, multi-vendor HW); the datacenter regime · source
  • llama-cpp — local/on-device C/C++ engine (Gerganov); GGUF 2–8-bit quantization; the edge regime (core of Ollama/LM Studio) · source

ScholarlyArticle (source summaries, source: true)

  • flash-attention-paper — Dao et al., FlashAttention (NeurIPS 2022); IO-aware exact attention, the primary paper + quantified speedups · source · T1 · arxiv.org
  • paged-attention-paper — Kwon et al., PagedAttention/vLLM (SOSP 2023); OS-paging for the KV cache; 2–4× serving throughput over FasterTransformer/Orca — the inference-serving primary · source · T1 · arxiv.org

TechArticle (source summaries, source: true)