FlashAttention
FlashAttention (Tri Dao et al.) is a “fast and memory-efficient exact attention” algorithm — it computes the same attention output as the standard implementation, but reorganizes the computation to be IO-aware, slashing slow-memory traffic. It sits at the attention-compute core of the prefill/decode pipeline. Source: official repo; the primary research paper (Dao et al., NeurIPS 2022) is paged separately as flash-attention-paper with the IO-complexity proof and quantified speedups.
Core idea — IO-awareness
The bottleneck isn’t FLOPs, it’s reads/writes to slow GPU HBM. FlashAttention uses tiling + kernel fusion to avoid ever materializing the full N×N attention matrix in HBM:
- exact, not approximate;
- linear (not quadratic) memory in sequence length — “10× savings at 2K, 20× at 4K”;
- 2–4× speedups on attention. This is why long contexts became practical — complementary to the kv-cache (which optimizes the decode step, while FlashAttention optimizes the attention math).
Versions
FA-2 (Ampere/Ada/Hopper), FA-3 (H100, FP8 forward), FA-4 (CuTeDSL; Hopper + Blackwell), plus ROCm/AMD backends. A moving target tracking new GPU generations.
Related
flash-attention-paper · prefill-decode-kv-cache · kv-cache · llm-inference · vllm