Blog Posting source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Best Open-Source Speech-to-Text Model in 2026 (Northflank)

Northflank’s benchmarked survey of open-source speech-to-text (STT/ASR) models — the founding STT source for this wiki, anchored on Open ASR Leaderboard WER.

Models & WER (Open ASR Leaderboard, English)

Model	WER	Size	Notable
canary-qwen (NVIDIA)	5.63%	2.5B	SALM: FastConformer encoder + Qwen3-1.7B decoder; RTFx 418×; CC-BY-4.0; English-only
IBM Granite Speech 3.3	5.85%	~9B	LoRA-tuned on Granite 3.3; +translation; Apache-2.0
whisper Large V3 (OpenAI)	7.4%	1.55B	99+ languages; MIT; the ecosystem default
Whisper Large V3 Turbo	7.75%	809M	4 decoder layers; RTFx 216×; ~6GB
Distil-Whisper Large V3	~within 1% of V3	756M	distilled, English-only; 5–6× faster; MIT
Parakeet TDT 1.1B (NVIDIA)	~8.0%	1.1B	RNN-Transducer streaming; RTFx >2,000×; CC-BY-4.0

Also: wav2vec 2.0 (Meta, self-supervised; 53-lang XLSR), Qwen3-ASR (Alibaba, 52 langs, 1.7B/0.6B), Moonshine (edge, from 27M params).

The pattern that matters

The accuracy leaders fuse ASR with an LLM — Canary-Qwen (SALM), Granite-Speech, Qwen3-ASR — the same “speech on an LLM backbone” convergence the TTS side shows (text-to-speech, neural-audio-codec); a direct bridge to llm-providers-wiki (gemini, open-weight Qwen/Llama). whisper no longer tops WER but dominates on ecosystem (MIT, 99+ langs, tooling).

Verdict

“No single best STT” — English accuracy → Canary-Qwen / Granite; multilingual → Whisper; speed → Parakeet / Distil-Whisper; edge → Moonshine. (See speech-to-text for the axes; commercial APIs in stt-apis-comparison.)

speech-to-text · speech-audio-ai · whisper · canary-qwen · stt-apis-comparison · open-weight-tts

Best Open-Source Speech-to-Text Model in 2026 (Northflank)

Models & WER (Open ASR Leaderboard, English)

The pattern that matters

Verdict

Related

Linked from