Software Application updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Canary-Qwen 2.5B

NVIDIA’s open-weight speech-to-text model that tops the Open ASR Leaderboard — WER 5.63% (English). 2.5B params, CC-BY-4.0, English-only, RTFx 418× open-source-stt-models.

The architecture that defines the 2026 trend

A SALM (Speech-Augmented Language Model): a FastConformer speech encoder feeding an unmodified Qwen3-1.7B LLM decoder. So ASR is reframed as language modeling over speech features — the recognition-side instance of the “speech on an LLM backbone” convergence the wiki tracks across text-to-speech (Llama-based TTS, neural-audio-codec) and music (suno). Siblings on the same idea: IBM Granite-Speech, Alibaba Qwen3-ASR.

Significance

Beats whisper (7.4%) and the commercial batch leaders on English WER while staying open-weight.
Direct cross-wiki bridge: its decoder is literally a Qwen3 LLM — the gemini/open-weight-Qwen lineage in llm-providers-wiki, now doing transcription.
Tradeoff: English-only — Whisper still wins multilingual.

speech-to-text · speech-audio-ai · whisper · open-source-stt-models · neural-audio-codec

Canary-Qwen 2.5B

The architecture that defines the 2026 trend

Significance

Related

Linked from