Artificial Analysis — Text-to-Speech Leaderboard
The neutral, Elo-based TTS leaderboard (the speech analog of an LLM leaderboard), ranking text-to-speech models by blind human preference in a “Speech Arena”: listeners compare two samples and vote which sounds more natural. The wiki’s anchor for tts-benchmarks. All numbers are a 2026-06 snapshot — they churn weekly.
Top of the board (closed-weight)
- Fun-Realtime-TTS (Alibaba) — Elo 1227
- gemini 3.1 Flash TTS (Google) — 1217
- Realtime TTS-2 (Inworld) — 1206
- Sonic 3.5 (Cartesia) — 1206
- Realtime TTS 1.5 Max (Inworld) — 1199
Every model in the top tier is closed/API-only — no open-weight model cracks the leaders, the same proprietary-frontier pattern llm-providers-wiki sees for text (open-weight-tts tracks the gap).
Highest open-weight models
- fish-audio-s2-pro — Elo 1128 (the open leader)
- Step Audio EditX — 1112
- Voxtral TTS — 1070
- kokoro 82M v1.0 — 1064 (cheapest, $0.65 / 1M chars)
- Magpie-Multilingual 357M — 1059
Method & caveats
Elo from blind A/B votes in the Speech Arena. Reflects perceived naturalness, not WER/latency — pair with the tts-benchmarks page’s other axes (CER, MOS, TTFA). Vote populations and sample sets bias results; treat as indicative. Cross-source check: the MarkTechPost survey tts-models-2026-benchmark cites slightly different Elo values (e.g. Gemini 1216 vs 1217 here) — expected snapshot drift, flagged in synthesis.
Related
tts-benchmarks · text-to-speech · open-weight-tts · fish-audio-s2-pro · kokoro · tts-models-2026-benchmark