FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)
The primary research paper behind flash-attention — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, NeurIPS 2022 (arXiv:2205.14135). The flash-attention page is anchored to the maintained implementation repo; this page anchors the algorithm and its quantified analysis to the peer-reviewed origin, and supplies the hard numbers the spoke’s open questions (memory math, quantified neutral benchmarks) had been missing.
The thesis: attention is IO-bound, not FLOP-bound
The paper’s core argument is that prior attention work optimized FLOPs while the real bottleneck is memory traffic between GPU HBM (large, slow) and on-chip SRAM (small, fast). The missing principle is making attention IO-aware — counting reads/writes across the memory hierarchy. FlashAttention uses tiling (and recomputation in the backward pass) to compute exact attention without ever materializing the full N×N attention matrix in HBM, and the authors prove it makes fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. “Exact,” not approximate — it returns identical outputs, so it is a free speedup, not a quality tradeoff.
The quantified results (what the open questions wanted)
- Memory: linear, not quadratic in sequence length — the concrete answer to the synthesis’s “memory math” gap for the attention step (the kv-cache is the decode-step analog).
- End-to-end training speedups, with hardware/model specifics: 15% wall-clock on BERT-large (seq 512) vs the MLPerf 1.1 record; 3× on GPT-2 (seq 1K); 2.4× on long-range arena (1K–4K).
- Quality from longer context, now affordable: +0.7 perplexity on GPT-2; +6.4 points on long-document classification; first Transformers to beat chance on Path-X (16K): 61.4% and Path-256 (64K): 63.1%.
- Block-sparse FlashAttention extends the idea to an approximate variant “faster than any existing approximate attention method.”
Why it matters for the spoke
This is the primary, neutrally-benchmarked source the inference thesis leaned on qualitatively: it is why long contexts became practical, and the IO-aware framing is the lens for the whole prefill compute layer (prefill-decode-kv-cache). It partially closes two open questions — the memory math (attention memory is provably linear) and the quantified benchmarks gap (real model/hardware numbers from a peer-reviewed source, not a tutorial) — though it benchmarks training; an inference-serving controlled study (the vLLM “which lever bought what” question) is still open. The maintained kernels (FA-2/3/4) on the flash-attention repo are the engineering descendants of this result.
Tier
T1 — peer-reviewed primary (NeurIPS 2022) with a formal IO-complexity proof and reproducible benchmarks. Distinct artifact from the implementation repo flash-attention cites: paper = the algorithm + analysis; repo = the evolving CUDA kernels. Figures here are from the paper’s abstract; the arXiv PDF body was not fully extracted.
Related
flash-attention · prefill-decode-kv-cache · kv-cache · vllm · llm-inference