Spokes.wiki Search Graph Growth About

speech-audio-wiki

Dataset source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Artificial Analysis — Text-to-Speech Leaderboard

The neutral, Elo-based TTS leaderboard (the speech analog of an LLM leaderboard), ranking text-to-speech models by blind human preference in a “Speech Arena”: listeners compare two samples and vote which sounds more natural. The wiki’s anchor for tts-benchmarks. All numbers are a 2026-06 snapshot — they churn weekly.

Top of the board (closed-weight)

  1. Fun-Realtime-TTS (Alibaba) — Elo 1227
  2. gemini 3.1 Flash TTS (Google) — 1217
  3. Realtime TTS-2 (Inworld) — 1206
  4. Sonic 3.5 (Cartesia) — 1206
  5. Realtime TTS 1.5 Max (Inworld) — 1199

Every model in the top tier is closed/API-only — no open-weight model cracks the leaders, the same proprietary-frontier pattern llm-providers-wiki sees for text (open-weight-tts tracks the gap).

Highest open-weight models

  1. fish-audio-s2-pro — Elo 1128 (the open leader)
  2. Step Audio EditX — 1112
  3. Voxtral TTS — 1070
  4. kokoro 82M v1.0 — 1064 (cheapest, $0.65 / 1M chars)
  5. Magpie-Multilingual 357M — 1059

Method & caveats

Elo from blind A/B votes in the Speech Arena. Reflects perceived naturalness, not WER/latency — pair with the tts-benchmarks page’s other axes (CER, MOS, TTFA). Vote populations and sample sets bias results; treat as indicative. Cross-source check: the MarkTechPost survey tts-models-2026-benchmark cites slightly different Elo values (e.g. Gemini 1216 vs 1217 here) — expected snapshot drift, flagged in synthesis.

tts-benchmarks · text-to-speech · open-weight-tts · fish-audio-s2-pro · kokoro · tts-models-2026-benchmark