Gemma 4 with Quantization-Aware Training (Google)
google‘s announcement of QAT-optimized checkpoints for the gemma-4 family — the next move in the family’s footprint-first strategy, pushing open weights (open-weight-models) into ever-smaller memory envelopes via quantization.
What’s announced
- QAT checkpoints released for E2B (edge 2B), E4B (edge 4B), and the 26B MoE variants of gemma-4.
- Two quantization schemes:
- Q4_0 — standard 4-bit quantization across all models.
- Mobile-specialized format — targeted 2-bit quantization on token-generation layers while keeping reasoning components at higher precision, plus static activation scaling and channel-wise quantization.
- QAT vs PTQ: because quantization is folded into training (not applied post-hoc), Google claims it “yield[s] even higher overall quality compared to standard PTQ baselines” — i.e. less quality loss per bit than post-training quantization.
Numbers
- The E2B text-only model (without Per-Layer Embeddings) reportedly requires < 1 GB of memory — sub-gigabyte LLM deployment.
- (A VRAM comparison chart is referenced for the other sizes; specific figures not in the text.)
Tooling & availability
- On Hugging Face in GGUF and compressed-tensor formats.
- Deploys via llama.cpp, Ollama, LM Studio, vLLM, SGLang, MLX, Transformers.js, LiteRT-LM, Unsloth — a notably broad runtime list skewed to local / edge / mobile inference (llm-inference).
Why it matters
Confirms memory footprint as a competitive axis for gemma-4: the 12B made the quality case at 16 GB gemma-4-12b-announcement; this makes the floor case — capable open models under 1 GB, runnable on phones. Quantization, not just architecture, is now an explicit lever in the open-weight market (synthesis). No API pricing — these are self-host weights (llm-api-pricing).
Related
gemma-4 · quantization · google · open-weight-models · gemma-4-12b-announcement · llm-inference