Quantization
Storing and running a model’s weights (and sometimes activations) at lower numeric precision — int8, int4, even 2-bit — instead of 16-bit floats, to shrink its memory footprint and speed inference. In this wiki it matters as a market lever: quantization is a primary reason capable open-weight-models now run on consumer and edge hardware, making footprint a competitive axis alongside capability, cost, and context (synthesis).
QAT vs PTQ
- PTQ (post-training quantization): quantize an already-trained model. Simple, but loses quality as bit-width drops.
- QAT (quantization-aware training): fold quantization into training so the model learns to tolerate low precision. google‘s Gemma 4 QAT release gemma-4-qat claims QAT yields higher quality than PTQ baselines at the same bit-width — less quality lost per bit.
Levels seen in the wild
- Q4_0 / int4 — the standard ~4-bit workhorse (gemma-4-qat).
- Targeted 2-bit — Gemma 4’s mobile scheme quantizes token-generation layers to 2-bit while keeping reasoning layers higher-precision (mixed precision), plus channel-wise quantization and static activation scaling gemma-4-qat.
- Footprint payoff: a gemma-4 E2B text model under 1 GB — sub-gigabyte LLMs.
Boundary (cross-wiki)
This page covers quantization as a deployability / market lever (who can run what, where). The deeper inference mechanics of low-precision execution — kernels, dequant, throughput — belong to llm-inference-wiki (llm-inference), alongside the kv-cache / MoE efficiency story. Distribution formats (GGUF, compressed tensors) ride along with it.
Related
gemma-4-qat · gemma-4 · open-weight-models · google · llm-inference