Defined Term concept updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

TTS benchmarks

How text-to-speech quality is measured — and why “best” is multi-axis and volatile. The leading public ranking is the Speech Arena Elo board (tts-arena-leaderboard).

The metrics

Elo (Speech Arena) — blind A/B human preference votes → an Elo rating; captures perceived naturalness but nothing about latency or accuracy. The headline number.
MOS (Mean Opinion Score) — rated naturalness, typically on ~10-second clips (UTMOS training bounds it); kokoro scores ~4.5 MOS.
WER / CER — word/character error rate via round-trip ASR transcription; depends on the ASR model, so it’s noisy. fish-audio-s2-pro ~3.5% WER / 1.2% CER (English, vendor figure).
TTFA (time-to-first-audio) — the latency metric that matters for real-time UX; Sonic 3.5 ~82ms, Deepgram Aura-2 <90ms, misotts claimed 110ms.

Caveats (why these are snapshots)

Rankings shift weekly — tts-models-2026-benchmark stresses leaderboard positions are dated snapshots, not fixed truth (this wiki dates every figure).
Sources disagree — Elo values differ across boards (Gemini 1216 vs 1217; Fish S2 Pro 1123 vs 1128) due to different vote pools/sample sets; a tension flagged in synthesis.
Vendor-reported numbers (latency, WER) have incentives; treat as indicative.
Metrics are partial: Elo ≠ accuracy ≠ latency. A model wins or loses per axis.

tts-arena-leaderboard · text-to-speech · open-weight-tts · tts-models-2026-benchmark

TTS benchmarks

The metrics

Caveats (why these are snapshots)

Related

Linked from