Canary-Qwen 2.5B
NVIDIA’s open-weight speech-to-text model that tops the Open ASR Leaderboard — WER 5.63% (English). 2.5B params, CC-BY-4.0, English-only, RTFx 418× open-source-stt-models.
The architecture that defines the 2026 trend
A SALM (Speech-Augmented Language Model): a FastConformer speech encoder feeding an unmodified Qwen3-1.7B LLM decoder. So ASR is reframed as language modeling over speech features — the recognition-side instance of the “speech on an LLM backbone” convergence the wiki tracks across text-to-speech (Llama-based TTS, neural-audio-codec) and music (suno). Siblings on the same idea: IBM Granite-Speech, Alibaba Qwen3-ASR.
Significance
- Beats whisper (7.4%) and the commercial batch leaders on English WER while staying open-weight.
- Direct cross-wiki bridge: its decoder is literally a Qwen3 LLM — the gemini/open-weight-Qwen lineage in llm-providers-wiki, now doing transcription.
- Tradeoff: English-only — Whisper still wins multilingual.
Related
speech-to-text · speech-audio-ai · whisper · open-source-stt-models · neural-audio-codec