Defined Term mechanism updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Quantization

Storing and running a model’s weights (and sometimes activations) at lower numeric precision — int8, int4, even 2-bit — instead of 16-bit floats, to shrink its memory footprint and speed inference. In this wiki it matters as a market lever: quantization is a primary reason capable open-weight-models now run on consumer and edge hardware, making footprint a competitive axis alongside capability, cost, and context (synthesis).

QAT vs PTQ

PTQ (post-training quantization): quantize an already-trained model. Simple, but loses quality as bit-width drops.
QAT (quantization-aware training): fold quantization into training so the model learns to tolerate low precision. google‘s Gemma 4 QAT release gemma-4-qat claims QAT yields higher quality than PTQ baselines at the same bit-width — less quality lost per bit.

Levels seen in the wild

Q4_0 / int4 — the standard ~4-bit workhorse (gemma-4-qat).
Targeted 2-bit — Gemma 4’s mobile scheme quantizes token-generation layers to 2-bit while keeping reasoning layers higher-precision (mixed precision), plus channel-wise quantization and static activation scaling gemma-4-qat.
Footprint payoff: a gemma-4 E2B text model under 1 GB — sub-gigabyte LLMs.

Boundary (cross-wiki)

This page covers quantization as a deployability / market lever (who can run what, where). The deeper inference mechanics of low-precision execution — kernels, dequant, throughput — belong to llm-inference-wiki (llm-inference), alongside the kv-cache / MoE efficiency story. Distribution formats (GGUF, compressed tensors) ride along with it.

gemma-4-qat · gemma-4 · open-weight-models · google · llm-inference

Quantization

QAT vs PTQ

Levels seen in the wild

Boundary (cross-wiki)

Related

Linked from