Blog Posting source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Speech-to-Text APIs in 2026: Benchmarks, Pricing (Future AGI)

A comparison of the leading commercial speech-to-text APIs — the closed/managed side opposite the open-source models in open-source-stt-models.

The APIs (WER / latency / price)

API	WER	Latency	Languages	Price
Deepgram Nova-3	5.26% batch / 6.84% stream	<300ms	36+	$0.0043–0.0077/min
ElevenLabs Scribe v2 Realtime	~3.3% (EN)	~150ms	90+	$0.22–0.48/hr
OpenAI GPT-4o Transcribe	~8.9%	not real-time	57+	$6/1K min
Google Cloud Chirp 3	~11.6%	variable	125+	$4–16/1K min
AssemblyAI Universal-2	~14.5%	~760ms	99+	~$0.0062/min

(WER numbers come from differing test sets — treat as indicative snapshots, cf. speech-to-text.)

Lowest WER is now commercial (ElevenLabs Scribe ~3.3% EN, Deepgram 5.26% batch) — below the open leaders’ ~5.6% — but the gap is small and dataset-dependent. Same closed-edge-over-open pattern as text-to-speech, thinner than on TTS Elo.
Pick by axis: real-time agents → Deepgram / ElevenLabs (~150–300ms); multilingual → Chirp / AssemblyAI; batch accuracy → GPT-4o Transcribe; built-in “speech intelligence” (sentiment/topics) → AssemblyAI.
The tradeoff: commercial APIs trade control for convenience; open-source (whisper, canary-qwen) wins on cost/autonomy above heavy volume (GPU economics) — the same build-vs-buy calculus open-source-stt-models frames.