Gemma 4
google‘s open-weight (open-weight-models) model family — the locally-runnable, Apache-2.0 counterpart to its proprietary Gemini line. A multi-size lineup tuned for on-device / efficient deployment.
Family (2026)
| Variant | Params | Notes |
|---|---|---|
| E4B | small | the lightweight end |
| 12B | 12B (dense) | encoder-free multimodal (vision + native audio); runs on 16GB VRAM/unified memory; ~26B-level benchmarks at <½ the memory gemma-4-12b-announcement |
| 26B-A4B | MoE 25.2B total / 3.8B active, 256K ctx | the flagship; sparse activation for cheap inference open-source-llms-2026 |
Notable
- Encoder-free multimodality (12B): vision via a single matrix-mul embedding module; audio projected directly into the text token space — no separate modality encoders gemma-4-12b-announcement.
- MoE at the top end (26B-A4B): activates 3.8B of 25.2B params — the sparse-MoE pattern that dominates 2026 open weights (open-weight-models, cf. llm-inference).
- Licensing: Apache-2.0 — among the permissively-licensed open-weight leaders.
- quantization (QAT): official quantization-aware-trained checkpoints for E2B / E4B / 26B-MoE — Q4_0 (4-bit) plus a targeted 2-bit mobile scheme; the E2B text model fits in < 1 GB, claimed higher quality than PTQ gemma-4-qat. Footprint is an explicit lever.
- Deployment: Hugging Face, Kaggle (GGUF + compressed-tensor); runs via llama.cpp, Ollama, LM Studio, MLX, vLLM, SGLang, Transformers.js, LiteRT-LM, Unsloth — broad local/edge support. Open weights → self-host, not a metered API (llm-api-pricing).
Place in the market
Gemma 4 is google‘s entry in the open-weight wave alongside Meta Llama 4, Alibaba Qwen3, deepseek, Moonshot Kimi — competing on local-deploy efficiency and modality rather than raw frontier reasoning. The 12B’s encoder-free audio is the family’s current differentiator.
Related
google · open-weight-models · quantization · gemma-4-qat · gemma-4-12b-announcement · open-source-llms-2026 · llm-benchmarks · deepseek