Best Open-Source Speech-to-Text Model in 2026 (Northflank)
Northflank’s benchmarked survey of open-source speech-to-text (STT/ASR) models — the founding STT source for this wiki, anchored on Open ASR Leaderboard WER.
Models & WER (Open ASR Leaderboard, English)
| Model | WER | Size | Notable |
|---|---|---|---|
| canary-qwen (NVIDIA) | 5.63% | 2.5B | SALM: FastConformer encoder + Qwen3-1.7B decoder; RTFx 418×; CC-BY-4.0; English-only |
| IBM Granite Speech 3.3 | 5.85% | ~9B | LoRA-tuned on Granite 3.3; +translation; Apache-2.0 |
| whisper Large V3 (OpenAI) | 7.4% | 1.55B | 99+ languages; MIT; the ecosystem default |
| Whisper Large V3 Turbo | 7.75% | 809M | 4 decoder layers; RTFx 216×; ~6GB |
| Distil-Whisper Large V3 | ~within 1% of V3 | 756M | distilled, English-only; 5–6× faster; MIT |
| Parakeet TDT 1.1B (NVIDIA) | ~8.0% | 1.1B | RNN-Transducer streaming; RTFx >2,000×; CC-BY-4.0 |
Also: wav2vec 2.0 (Meta, self-supervised; 53-lang XLSR), Qwen3-ASR (Alibaba, 52 langs, 1.7B/0.6B), Moonshine (edge, from 27M params).
The pattern that matters
The accuracy leaders fuse ASR with an LLM — Canary-Qwen (SALM), Granite-Speech, Qwen3-ASR — the same “speech on an LLM backbone” convergence the TTS side shows (text-to-speech, neural-audio-codec); a direct bridge to llm-providers-wiki (gemini, open-weight Qwen/Llama). whisper no longer tops WER but dominates on ecosystem (MIT, 99+ langs, tooling).
Verdict
“No single best STT” — English accuracy → Canary-Qwen / Granite; multilingual → Whisper; speed → Parakeet / Distil-Whisper; edge → Moonshine. (See speech-to-text for the axes; commercial APIs in stt-apis-comparison.)
Related
speech-to-text · speech-audio-ai · whisper · canary-qwen · stt-apis-comparison · open-weight-tts