Spokes.wiki Search Graph Growth About

speech-audio-wiki

Blog Posting source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Best Open-Source Speech-to-Text Model in 2026 (Northflank)

Northflank’s benchmarked survey of open-source speech-to-text (STT/ASR) models — the founding STT source for this wiki, anchored on Open ASR Leaderboard WER.

Models & WER (Open ASR Leaderboard, English)

ModelWERSizeNotable
canary-qwen (NVIDIA)5.63%2.5BSALM: FastConformer encoder + Qwen3-1.7B decoder; RTFx 418×; CC-BY-4.0; English-only
IBM Granite Speech 3.35.85%~9BLoRA-tuned on Granite 3.3; +translation; Apache-2.0
whisper Large V3 (OpenAI)7.4%1.55B99+ languages; MIT; the ecosystem default
Whisper Large V3 Turbo7.75%809M4 decoder layers; RTFx 216×; ~6GB
Distil-Whisper Large V3~within 1% of V3756Mdistilled, English-only; 5–6× faster; MIT
Parakeet TDT 1.1B (NVIDIA)~8.0%1.1BRNN-Transducer streaming; RTFx >2,000×; CC-BY-4.0

Also: wav2vec 2.0 (Meta, self-supervised; 53-lang XLSR), Qwen3-ASR (Alibaba, 52 langs, 1.7B/0.6B), Moonshine (edge, from 27M params).

The pattern that matters

The accuracy leaders fuse ASR with an LLM — Canary-Qwen (SALM), Granite-Speech, Qwen3-ASR — the same “speech on an LLM backbone” convergence the TTS side shows (text-to-speech, neural-audio-codec); a direct bridge to llm-providers-wiki (gemini, open-weight Qwen/Llama). whisper no longer tops WER but dominates on ecosystem (MIT, 99+ langs, tooling).

Verdict

“No single best STT” — English accuracy → Canary-Qwen / Granite; multilingual → Whisper; speed → Parakeet / Distil-Whisper; edge → Moonshine. (See speech-to-text for the axes; commercial APIs in stt-apis-comparison.)

speech-to-text · speech-audio-ai · whisper · canary-qwen · stt-apis-comparison · open-weight-tts