Whisper
OpenAI’s open-weight speech-to-text model family — the ecosystem default for ASR. MIT licensed; a transformer encoder-decoder. Its origin and the why behind its dominance — large-scale weak supervision (680k hours, zero-shot, no fine-tuning) — are grounded in the primary paper whisper-paper (Radford et al., 2022).
Variants
- Large V3 — 1.55B, 32 decoder layers, 99+ languages, WER 7.4%, ~10GB VRAM.
- Large V3 Turbo — 809M, 4 decoder layers, RTFx 216×, ~6GB; near-V3 accuracy much faster.
- Distil-Whisper V3 — 756M distilled, English-only, 5–6× faster, within ~1% WER open-source-stt-models.
Why it still matters despite not topping WER
By 2026 Whisper is no longer the most accurate — canary-qwen (5.63%), IBM Granite (5.85%), Qwen3-ASR all beat it on English WER. Its dominance is ecosystem: permissive MIT license, 99+ language coverage, and an enormous volume of community tooling and integrations. It’s the safe multilingual default; the SALM models win on raw English accuracy speech-to-text.
Place in the wiki
The open-source STT anchor — the recognition-side counterpart to kokoro‘s role on the synthesis side (the permissive, widely-deployed baseline). Available self-hosted or via OpenAI’s API (stt-apis-comparison).
Related
whisper-paper · speech-to-text · speech-audio-ai · canary-qwen · open-source-stt-models · stt-apis-comparison