Spokes.wiki Search Graph Growth About

llm-inference-wiki

Defined Term mechanism source ↗ source url updated Wed Jun 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Quantization (for inference)

The explicitly-named gap the synthesis flagged as “still absent”: storing a model’s weights (and sometimes activations) in lower precision — int8, int4, even 1–2 bit — instead of fp32/fp16/bf16, to shrink memory and speed inference while preserving as much accuracy as possible. It is the fourth production lever alongside PagedAttention, flash-attention, and speculative-decoding — but it acts on the data type rather than the algorithm.

Why it matters for the inference pipeline

How it’s done

The tradeoff

Lower precision = smaller/faster but lossier: aggressive (1–2 bit) quantization needs careful calibration to avoid accuracy collapse, while 8-bit and weight-only 4-bit often run “out of the box” with little degradation. This is the knob that makes the founding pipeline’s economics (prefill/decode, KV-cache memory) tractable on real hardware — and it is exactly what llama-cpp exploits to run frontier-class models on a laptop.

Boundary (cross-wiki) — deliberate dual-lens, not a duplicate

This page is the inference-mechanics lens: data-type lever, KV-cache interaction, methods (GPTQ/AWQ/bitsandbytes/GGUF), bandwidth/throughput. The market / deployability lens — quantization as a competitive footprint axis (who can run what where; QAT vs PTQ; Gemma 4’s sub-1GB open models) — is paged separately as quantization in llm-providers-wiki, which defers the mechanics here. Same technique, two spoke-specific lenses by design (the hub pages concepts per-spoke; only entities are canonicalized to one node) — cross-referenced, not merged.

kv-cache · llama-cpp · vllm · continuous-batching · flash-attention · llm-inference