Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022)
The primary research paper behind whisper — Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever (OpenAI), arXiv:2212.04356, Dec 2022 (ICML 2023). The whisper page records that Whisper is the multilingual ASR default; this page anchors why in the canonical source: the weak-supervision recipe that produced it.
The thesis: scale weak supervision, skip fine-tuning
Whisper’s central bet is scale of weakly-supervised data over curation. The models are trained on 680,000 hours of multilingual and multitask supervised audio collected from the web — far larger than the curated, gold-labelled corpora prior ASR relied on, and “weak” in that the labels are imperfect web transcripts rather than hand-verified. The payoff is zero-shot transfer: the resulting models “generalize well to standard benchmarks and are often competitive with prior fully supervised results without the need for any fine-tuning,” and “approach [human] accuracy and robustness.” OpenAI released the models and inference code “to serve as a foundation for further work on robust speech processing” — the MIT release that seeded the ecosystem the whisper page describes.
A single encoder-decoder Transformer handles transcription, translation, language identification, and timestamps by predicting them as one token sequence with special task tokens (well-established architecture detail; the abstract states the weak-supervision/zero-shot result, not the full design).
Why it grounds the spoke
This closes the provenance gap under one of the wiki’s load-bearing claims. The synthesis’s “open wedge in every branch” and “license, not score, decides” threads both lean on Whisper as the permissive, multilingual STT baseline — but the spoke sourced that only from comparison roundups (open-source-stt-models, stt-apis-comparison, T3/T4). The paper supplies the primary, peer-reviewed explanation for Whisper’s staying power: it is the default not because it tops WER (by 2026 canary-qwen and others beat it on English) but because weak-supervision-at-scale bought robustness + 99-language coverage in one open model. It also frames the SALM successors (canary-qwen = encoder + Qwen3 decoder) as the next step past Whisper’s encoder-decoder design — “speech as language modeling” pushed further.
Tier
T1 — peer-reviewed primary (ICML 2023) from the model’s authors; the canonical origin of the weak-supervision ASR approach. Distinct artifact from the whisper model/ecosystem page. Figures (680k hours, zero-shot, human-comparison) are from the abstract; full architecture from established knowledge, noted inline.
Related
whisper · speech-to-text · speech-audio-ai · canary-qwen · neural-audio-codec