Speech-to-text (STT / ASR)
Automatic speech recognition — transcribing speech audio into text. The recognition branch of speech-audio-ai (the mirror of text-to-speech‘s synthesis).
How it’s measured
- WER (word error rate) — the headline accuracy metric; lower is better. The neutral anchor is Hugging Face’s Open ASR Leaderboard (since 2023; 700K+ visits; recently added a “benchmaxxer repellant” to resist overfitting). 2026 English leaders: canary-qwen 5.63%, IBM Granite Speech 5.85%, whisper V3 7.4% open-source-stt-models.
- RTFx (real-time factor) — throughput; how many ×-real-time it transcribes. NVIDIA Parakeet TDT >2,000× (streaming, RNN-Transducer); Whisper V3 Turbo 216×.
- Latency — for streaming/agents: Deepgram <300ms, ElevenLabs Scribe ~150ms stt-apis-comparison.
- Language coverage — Whisper 99+, Google Chirp 125+, Qwen3-ASR 52.
The market shape
Mirrors the rest of speech-audio-ai: commercial APIs hold a thin accuracy edge (ElevenLabs Scribe ~3.3% EN, Deepgram 5.26% batch — stt-apis-comparison) over the open-source field (whisper, canary-qwen, Parakeet, Granite, Qwen3-ASR — open-source-stt-models), with the usual build-vs-buy crossover at high volume.
The defining 2026 trend — STT meets the LLM
Top accuracy now comes from SALM-style models that bolt an LLM decoder onto a speech encoder (canary-qwen = FastConformer + Qwen3; Granite-Speech; Qwen3-ASR) — recognition reframed as language modeling, the bridge to llm-providers-wiki (gemini, Qwen/Llama). whisper no longer leads WER but wins on ecosystem (MIT, languages, tooling).
Related
speech-audio-ai · text-to-speech · whisper · canary-qwen · open-source-stt-models · stt-apis-comparison · tts-benchmarks