Defined Term mechanism source ↗ source url updated Tue Jun 09 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

FlashAttention

FlashAttention (Tri Dao et al.) is a “fast and memory-efficient exact attention” algorithm — it computes the same attention output as the standard implementation, but reorganizes the computation to be IO-aware, slashing slow-memory traffic. It sits at the attention-compute core of the prefill/decode pipeline. Source: official repo; the primary research paper (Dao et al., NeurIPS 2022) is paged separately as flash-attention-paper with the IO-complexity proof and quantified speedups.

Core idea — IO-awareness

The bottleneck isn’t FLOPs, it’s reads/writes to slow GPU HBM. FlashAttention uses tiling + kernel fusion to avoid ever materializing the full N×N attention matrix in HBM:

exact, not approximate;
linear (not quadratic) memory in sequence length — “10× savings at 2K, 20× at 4K”;
2–4× speedups on attention. This is why long contexts became practical — complementary to the kv-cache (which optimizes the decode step, while FlashAttention optimizes the attention math).

Versions

FA-2 (Ampere/Ada/Hopper), FA-3 (H100, FP8 forward), FA-4 (CuTeDSL; Hopper + Blackwell), plus ROCm/AMD backends. A moving target tracking new GPU generations.

flash-attention-paper · prefill-decode-kv-cache · kv-cache · llm-inference · vllm

FlashAttention

Core idea — IO-awareness

Versions

Related

Linked from