Defined Term mechanism source ↗ source url updated Wed Jun 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Quantization (for inference)

The explicitly-named gap the synthesis flagged as “still absent”: storing a model’s weights (and sometimes activations) in lower precision — int8, int4, even 1–2 bit — instead of fp32/fp16/bf16, to shrink memory and speed inference while preserving as much accuracy as possible. It is the fourth production lever alongside PagedAttention, flash-attention, and speculative-decoding — but it acts on the data type rather than the algorithm.

Why it matters for the inference pipeline

Memory footprint. Weights at int4 are ~¼ the size of fp16 — the difference between a model fitting on one GPU (or a laptop) or not.
The kv-cache connection. Quantization also applies to the KV cache itself; since cache size caps batch size and context length (continuous-batching), a smaller-precision cache directly relaxes the serving bottleneck the wiki’s thesis centers on.
Bandwidth/throughput. Decode is memory-bandwidth-bound (moving weights + cache per token); fewer bytes per parameter means more tokens/sec, especially on consumer hardware.

How it’s done

Weight-only (most common for inference): quantize weights, compute in higher precision — big memory win, minimal quality loss.
Weight + activation (e.g. FP8, int8): quantize both for more speed, harder to keep accurate.
Post-training methods (no retraining): GPTQ (2/3/4/8-bit, calibration-based), AWQ (activation-aware, 4-bit), bitsandbytes (on-the-fly 4/8-bit), GGUF/GGML (llama-cpp‘s 2–8-bit format), plus FP8 paths (FBGEMM, torchao) and extreme 1–2-bit schemes (AQLM, VPTQ) that need calibration.

The tradeoff

Lower precision = smaller/faster but lossier: aggressive (1–2 bit) quantization needs careful calibration to avoid accuracy collapse, while 8-bit and weight-only 4-bit often run “out of the box” with little degradation. This is the knob that makes the founding pipeline’s economics (prefill/decode, KV-cache memory) tractable on real hardware — and it is exactly what llama-cpp exploits to run frontier-class models on a laptop.

Boundary (cross-wiki) — deliberate dual-lens, not a duplicate

This page is the inference-mechanics lens: data-type lever, KV-cache interaction, methods (GPTQ/AWQ/bitsandbytes/GGUF), bandwidth/throughput. The market / deployability lens — quantization as a competitive footprint axis (who can run what where; QAT vs PTQ; Gemma 4’s sub-1GB open models) — is paged separately as quantization in llm-providers-wiki, which defers the mechanics here. Same technique, two spoke-specific lenses by design (the hub pages concepts per-spoke; only entities are canonicalized to one node) — cross-referenced, not merged.

kv-cache · llama-cpp · vllm · continuous-batching · flash-attention · llm-inference

Quantization (for inference)

Why it matters for the inference pipeline

How it’s done

The tradeoff

Boundary (cross-wiki) — deliberate dual-lens, not a duplicate

Related

Linked from