Speech-to-Text APIs in 2026: Benchmarks, Pricing (Future AGI)
A comparison of the leading commercial speech-to-text APIs — the closed/managed side opposite the open-source models in open-source-stt-models.
The APIs (WER / latency / price)
| API | WER | Latency | Languages | Price |
|---|---|---|---|---|
| Deepgram Nova-3 | 5.26% batch / 6.84% stream | <300ms | 36+ | $0.0043–0.0077/min |
| ElevenLabs Scribe v2 Realtime | ~3.3% (EN) | ~150ms | 90+ | $0.22–0.48/hr |
| OpenAI GPT-4o Transcribe | ~8.9% | not real-time | 57+ | $6/1K min |
| Google Cloud Chirp 3 | ~11.6% | variable | 125+ | $4–16/1K min |
| AssemblyAI Universal-2 | ~14.5% | ~760ms | 99+ | ~$0.0062/min |
(WER numbers come from differing test sets — treat as indicative snapshots, cf. speech-to-text.)
Reading it
- Lowest WER is now commercial (ElevenLabs Scribe ~3.3% EN, Deepgram 5.26% batch) — below the open leaders’ ~5.6% — but the gap is small and dataset-dependent. Same closed-edge-over-open pattern as text-to-speech, thinner than on TTS Elo.
- Pick by axis: real-time agents → Deepgram / ElevenLabs (~150–300ms); multilingual → Chirp / AssemblyAI; batch accuracy → GPT-4o Transcribe; built-in “speech intelligence” (sentiment/topics) → AssemblyAI.
- The tradeoff: commercial APIs trade control for convenience; open-source (whisper, canary-qwen) wins on cost/autonomy above heavy volume (GPU economics) — the same build-vs-buy calculus open-source-stt-models frames.
Related
speech-to-text · speech-audio-ai · open-source-stt-models · whisper