Best Text-to-Speech Models in 2026: A Benchmark-Based Comparison (MarkTechPost)
A field survey (2026-05-30) ranking the leading text-to-speech models — proprietary and open-weight — by tts-benchmarks and matching them to use-cases. One of the founding sources of this wiki.
Proprietary top tier (by Artificial Analysis Elo — dated snapshot)
- gemini 3.1 Flash TTS (Google, Apr 2026) — speech-as-language-modeling; 200+ audio tags, 70+ languages, 30 voices; no streaming, 32k context. Elo ~1216.
- Inworld Realtime TTS-2 / 1.5 — P90 latency <130ms (Mini) / 250ms (Max); 100+ languages.
- Cartesia Sonic 3.5 — a State Space Model for linear-scaling inference; ~82ms TTFA; 42 languages, 500+ voices. Elo ~1204.
- ElevenLabs v3 — inline audio tags ([whispers]/[laughs]), multi-speaker “Text to Dialogue”; not real-time (Flash v2.5 for low latency).
- MiniMax Speech 2.x HD, Hume Octave 2 (reads for meaning → emotion without tags), OpenAI gpt-4o-mini-tts / GPT-Realtime-2 (natural-language voice steering; ~$0.015/min), Speechify SIMBA 3.0 (budget), Deepgram Aura-2 (<90ms).
Open-weight field
- fish-audio-s2-pro — highest-ranked open weight; 5B, dual-AR + RVQ (neural-audio-codec); 80+ languages; research license (commercial = paid). Elo ~1123.
- kokoro 82M — most efficient; StyleTTS2+ISTFTNet, Apache-2.0; 4.5 MOS / 17% CER; <$1/1M chars.
- CosyVoice 2 (500M, ultra-low-latency streaming, voice-cloning), IndexTTS-2 (duration control for dubbing; timbre/emotion split), VibeVoice (Microsoft 1.5B; ~90-min long-form), Qwen3-TTS, Maya1.
The verdict
“No single model wins; pick by your binding constraint — latency, quality, language coverage, or cost.” Real-time → Sonic 3.5 / Inworld / Aura-2; long-form → ElevenLabs v3 / Gemini / VibeVoice; on-device → kokoro / CosyVoice 2; dubbing → IndexTTS-2; emotion → Hume Octave 2. The author stresses rankings shift weekly — treat leaderboard positions as dated snapshots (tts-benchmarks).
Related
text-to-speech · tts-benchmarks · open-weight-tts · kokoro · fish-audio-s2-pro · tts-arena-leaderboard