Spokes.wiki Search Graph Growth About

llm-inference-wiki

Scholarly Article source ↗ source url updated Mon Jun 15 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)

The primary research paper behind flash-attention — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, NeurIPS 2022 (arXiv:2205.14135). The flash-attention page is anchored to the maintained implementation repo; this page anchors the algorithm and its quantified analysis to the peer-reviewed origin, and supplies the hard numbers the spoke’s open questions (memory math, quantified neutral benchmarks) had been missing.

The thesis: attention is IO-bound, not FLOP-bound

The paper’s core argument is that prior attention work optimized FLOPs while the real bottleneck is memory traffic between GPU HBM (large, slow) and on-chip SRAM (small, fast). The missing principle is making attention IO-aware — counting reads/writes across the memory hierarchy. FlashAttention uses tiling (and recomputation in the backward pass) to compute exact attention without ever materializing the full N×N attention matrix in HBM, and the authors prove it makes fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. “Exact,” not approximate — it returns identical outputs, so it is a free speedup, not a quality tradeoff.

The quantified results (what the open questions wanted)

Why it matters for the spoke

This is the primary, neutrally-benchmarked source the inference thesis leaned on qualitatively: it is why long contexts became practical, and the IO-aware framing is the lens for the whole prefill compute layer (prefill-decode-kv-cache). It partially closes two open questions — the memory math (attention memory is provably linear) and the quantified benchmarks gap (real model/hardware numbers from a peer-reviewed source, not a tutorial) — though it benchmarks training; an inference-serving controlled study (the vLLM “which lever bought what” question) is still open. The maintained kernels (FA-2/3/4) on the flash-attention repo are the engineering descendants of this result.

Tier

T1 — peer-reviewed primary (NeurIPS 2022) with a formal IO-complexity proof and reproducible benchmarks. Distinct artifact from the implementation repo flash-attention cites: paper = the algorithm + analysis; repo = the evolving CUDA kernels. Figures here are from the paper’s abstract; the arXiv PDF body was not fully extracted.

flash-attention · prefill-decode-kv-cache · kv-cache · vllm · llm-inference