Log — LLM Inference Wiki

Append-only history. Each entry starts with ## [YYYY-MM-DD] <op> | <title> where <op> is ingest, query, lint, or split, so grep "^## \[" log.md | tail -5 works.

[2026-06-01] split | llm-inference-wiki created from _inbox cluster (3 sources)

Spun out by the hub router when three MachineLearningMastery walk-throughs arrived in one Telegram burst, all on LLM inference mechanics — too specific for any existing spoke (research-wiki is tools-for-thought / agentic products, not inference internals). Scaffolded from CLAUDE.template.md; domain = the mechanics of LLM inference and serving. Ingested all three (URL-only, source: true + url:):

logits-softmax-sampling-walkthrough → concept token-sampling
prefill-decode-kv-cache → concepts llm-inference, kv-cache
continuous-batching-serving → concept continuous-batching Created 4 concept pages (DefinedTerm) + 3 source summaries (7 total). Synthesis frames the three as one pipeline at three altitudes (choose a token → run the model → share the GPU), unified by the kv-cache. Cross-spoke adjacency to research-wiki (model substrate) noted. Open questions: neutral benchmarks, sampling×serving interaction, KV-cache memory math.

[2026-06-09] ingest | +3 “beyond the basics” (vLLM/PagedAttention, FlashAttention, speculative decoding) — all-spokes cron test

Filled the synthesis “beyond the basics” open question with three authoritative sources: vllm (SoftwareApplication, src — PagedAttention = OS-paging the kv-cache, near-zero fragmentation + KV sharing; continuous-batching engine), flash-attention (DefinedTerm, src — IO-aware exact attention, tiling/fusion, 2–4× + linear memory), speculative-decoding (DefinedTerm, src — draft-and-verify, unchanged output distribution; the concrete sampling×serving coupling). Synthesis open question struck through (quantization still open); index gains a SoftwareApplication group. url-only. 7 → 10 pages.

[2026-06-10] ingest | Quantization + llama.cpp — all-spokes pass (the data-type lever + the edge regime)

Two new pages closing the explicitly-named “quantization still absent” gap. quantization (DefinedTerm, source, HF docs) — lower-precision weights/activations/KV (int8/int4/FP8 from fp16/bf16; weight-only vs weight+activation; post-training GPTQ/AWQ/bitsandbytes/GGUF + FP8 + extreme 1–2-bit AQLM/VPTQ). The fourth production lever after PagedAttention/flash-attention/ speculative-decoding, but orthogonal — it acts on the data type and also shrinks the kv-cache (relaxing the continuous-batching bottleneck). llama-cpp (SoftwareApplication, source, Wikipedia) — Gerganov’s dependency-free C/C++ engine, the de-facto core of Ollama/LM Studio; GGUF 2–8-bit format; runs quantized models on CPU/consumer GPU. Together they introduce two serving regimes: datacenter (vLLM + batching, keep the GPU busy) vs on-device (llama.cpp + quantization, fit the model at all) — new synthesis section. Folded into synthesis (open-Q resolved + “two serving regimes” section) + index (new DefinedTerm + SoftwareApplication rows). No contradictions. 10 → 12 pages.

[2026-06-11] ingest | vLLM GitHub repo (github.com/vllm-project/vllm)

Telegram drop, hub-routed → llm-inference-wiki (clean single match: vLLM/PagedAttention are this spoke’s founding subject). Re-seen subject: vllm already existed as a docs-URL ingest → refreshed in place, re-anchored to its primary source (the GitHub repo) and the url: switched docs→repo (updated: 2026-06-11). Folded in the repo’s fuller picture: vLLM bundles every datacenter lever in one engine — PagedAttention + continuous-batching + chunked prefill + prefix caching + speculative-decoding (n-gram/suffix/EAGLE/DFlash) + quantization (FP8/MXFP4/NVFP4/INT8/INT4/ GPTQ/AWQ/GGUF) + flash-attention kernels + tensor/pipeline/expert/context parallelism — across NVIDIA/AMD/CPU/TPU/Gaudi and 200+ model architectures (MoE, multimodal). synthesis.md: added “vLLM as the datacenter regime’s convergence point” note under the two-regimes section; index entry broadened. No new pages (idempotent refresh). Benchmark caveat preserved (no first-party numbers). Site rebuild follows.

[2026-06-15] ingest | FlashAttention primary paper (Dao et al. 2022) — T1 anchor + quantified benchmarks

Quality cycle, T1-floor raise. The flash-attention page was anchored to the implementation repo; added the primary paper as a distinct source: flash-attention-paper (ScholarlyArticle, T1) — Dao, Fu, Ermon, Rudra, Ré, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS 2022 (arXiv:2205.14135). Supplies the hardware/model-specific numbers the synthesis open questions wanted: 15% BERT-large / 3× GPT-2 / 2.4× long-range-arena training speedups, linear (not quadratic) attention memory, Path-X 61.4% / Path-256 63.1%, formal IO-complexity proof. Partly closes the quantified benchmarks + memory math open questions (training-side; serving-side still open). Concept/paper split mirrors optimization-wiki’s NFL pattern. Found via WebSearch; figures from the abstract (PDF body not fully extracted) — noted on the page. Linked from concept page, synthesis (2 open qs), index (new ScholarlyArticle section). 1 new page.

[2026-06-16] ingest | PagedAttention paper (Kwon et al., SOSP 2023) — quality-cycle floor-raise

Closed the spoke’s standing inference-serving benchmark gap. Added paged-attention-paper (ScholarlyArticle, source:true, T1, arXiv:2309.06180) — the peer-reviewed primary behind vllm‘s PagedAttention: OS-paging for the kv-cache, near-zero fragmentation + KV sharing, 2–4× throughput over FasterTransformer/Orca at equal latency. Refreshed vllm (Benchmark-caveat now cites the 2–4× primary; PagedAttention headline links the paper) and kv-cache (fragmentation/serving-cost grounded) in place; folded into synthesis open questions (the serving controlled-study half of “quantified benchmarks” now closed — only the full which-lever decomposition remains; “memory math” serving cost grounded). Numbers are abstract-only (PDF body not extracted); recorded on the page. Entity discovery: no new nodes — authors named inline (mirrors flash-attention-paper precedent); SOSP venue too thin to page. +1 page (→14). Highest-value action this cycle: raises the spoke’s T1 floor and answers a named gap.

[2026-06-17] ingest | How does vLLM work? (Amit Shekhar / Outcome School) — accessible secondary

Hub-routed from Telegram (outcomeschool.com/blog/how-does-vllm-work). Clean single-spoke route — vLLM inference mechanics is this spoke’s core. Dedup: subject already paged (vllm T1 repo + paged-attention-paper T1). Added how-does-vllm-work (TechArticle, source:true, T3 — a single-author educational blog post, no benchmarks/first-party numbers). Gap-relevance: does NOT advance the standing open question (which-lever-bought-what benchmark decomposition) — recorded as a pedagogical on-ramp only (OS-paging analogy, the “50-of-2000 tokens reserved-but-idle” fragmentation framing, prefix/beam-search sharing). Integrated: cross-linked into vllm (“Accessible secondary” section + Related) and noted in synthesis under the benchmark open-question (a no-numbers secondary, not movement on the gap). Entity discovery: no nodes minted — author Amit Shekhar / publisher Outcome School are low graph-signal for an inference-mechanics spoke (no existing entity-index match), recorded inline (same call as prior T3 content-site authors). Ran avoid-ai-writing over the new prose. +1 page (14 -> 15).