Introducing Gemma 4 12B (Google)
Google’s announcement of Gemma 4 12B, a new open-weight variant in the gemma-4 family from google. Positioned between the small E4B and the larger 26B-A4B MoE (open-source-llms-2026) — the dense, mid-size member aimed at local deployment on consumer hardware.
What’s announced
- A unified, encoder-free multimodal model — vision and audio inputs flow straight
through the language-model backbone, no separate modality encoders.
- Vision: a lightweight embedding module — “a single matrix multiplication, positional embedding and normalizations.”
- Audio: native input, projecting “raw audio signal into the same dimensional space as text tokens” — no audio encoder at all.
- “Advanced reasoning” — pitched for “powerful multi-step reasoning and agentic workflows.”
Numbers
- Performance “nearing our 26B model” on standard benchmarks (cf. llm-benchmarks) at < half the total memory footprint.
- Runs locally on 16GB of VRAM or unified memory.
Licensing & availability
- Apache-2.0 (open-weight-models) — weights open on Hugging Face and Kaggle.
- Framework support: Hugging Face Transformers, llama.cpp, MLX, vLLM (llm-inference).
- No API pricing in the announcement (open weights; self-host — cf. llm-api-pricing).
Why it matters
Two firsts for this wiki’s open-weight thread: (1) multimodality including native audio in a locally-runnable open model, and (2) an encoder-free architecture that folds audio into the token stream. The competitive lever is memory — frontier-adjacent quality at half the footprint, runnable on a 16GB laptop/GPU.
Related
gemma-4 · google · open-weight-models · open-source-llms-2026 · llm-benchmarks · llm-inference