TTS benchmarks
How text-to-speech quality is measured — and why “best” is multi-axis and volatile. The leading public ranking is the Speech Arena Elo board (tts-arena-leaderboard).
The metrics
- Elo (Speech Arena) — blind A/B human preference votes → an Elo rating; captures perceived naturalness but nothing about latency or accuracy. The headline number.
- MOS (Mean Opinion Score) — rated naturalness, typically on ~10-second clips (UTMOS training bounds it); kokoro scores ~4.5 MOS.
- WER / CER — word/character error rate via round-trip ASR transcription; depends on the ASR model, so it’s noisy. fish-audio-s2-pro ~3.5% WER / 1.2% CER (English, vendor figure).
- TTFA (time-to-first-audio) — the latency metric that matters for real-time UX; Sonic 3.5 ~82ms, Deepgram Aura-2 <90ms, misotts claimed 110ms.
Caveats (why these are snapshots)
- Rankings shift weekly — tts-models-2026-benchmark stresses leaderboard positions are dated snapshots, not fixed truth (this wiki dates every figure).
- Sources disagree — Elo values differ across boards (Gemini 1216 vs 1217; Fish S2 Pro 1123 vs 1128) due to different vote pools/sample sets; a tension flagged in synthesis.
- Vendor-reported numbers (latency, WER) have incentives; treat as indicative.
- Metrics are partial: Elo ≠ accuracy ≠ latency. A model wins or loses per axis.
Related
tts-arena-leaderboard · text-to-speech · open-weight-tts · tts-models-2026-benchmark