Quantization (for inference)
The explicitly-named gap the synthesis flagged as “still absent”: storing a model’s weights (and sometimes activations) in lower precision — int8, int4, even 1–2 bit — instead of fp32/fp16/bf16, to shrink memory and speed inference while preserving as much accuracy as possible. It is the fourth production lever alongside PagedAttention, flash-attention, and speculative-decoding — but it acts on the data type rather than the algorithm.
Why it matters for the inference pipeline
- Memory footprint. Weights at int4 are ~¼ the size of fp16 — the difference between a model fitting on one GPU (or a laptop) or not.
- The kv-cache connection. Quantization also applies to the KV cache itself; since cache size caps batch size and context length (continuous-batching), a smaller-precision cache directly relaxes the serving bottleneck the wiki’s thesis centers on.
- Bandwidth/throughput. Decode is memory-bandwidth-bound (moving weights + cache per token); fewer bytes per parameter means more tokens/sec, especially on consumer hardware.
How it’s done
- Weight-only (most common for inference): quantize weights, compute in higher precision — big memory win, minimal quality loss.
- Weight + activation (e.g. FP8, int8): quantize both for more speed, harder to keep accurate.
- Post-training methods (no retraining): GPTQ (2/3/4/8-bit, calibration-based), AWQ (activation-aware, 4-bit), bitsandbytes (on-the-fly 4/8-bit), GGUF/GGML (llama-cpp‘s 2–8-bit format), plus FP8 paths (FBGEMM, torchao) and extreme 1–2-bit schemes (AQLM, VPTQ) that need calibration.
The tradeoff
Lower precision = smaller/faster but lossier: aggressive (1–2 bit) quantization needs careful calibration to avoid accuracy collapse, while 8-bit and weight-only 4-bit often run “out of the box” with little degradation. This is the knob that makes the founding pipeline’s economics (prefill/decode, KV-cache memory) tractable on real hardware — and it is exactly what llama-cpp exploits to run frontier-class models on a laptop.
Boundary (cross-wiki) — deliberate dual-lens, not a duplicate
This page is the inference-mechanics lens: data-type lever, KV-cache interaction, methods
(GPTQ/AWQ/bitsandbytes/GGUF), bandwidth/throughput. The market / deployability lens — quantization
as a competitive footprint axis (who can run what where; QAT vs PTQ; Gemma 4’s sub-1GB open models) —
is paged separately as quantization in llm-providers-wiki, which defers the mechanics here. Same
technique, two spoke-specific lenses by design (the hub pages concepts per-spoke; only entities are
canonicalized to one node) — cross-referenced, not merged.
Related
kv-cache · llama-cpp · vllm · continuous-batching · flash-attention · llm-inference