Defined Term domain updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Text-to-speech (TTS)

Synthesizing natural speech audio from text (optionally conditioned on a reference voice or emotional cue). The synthesis branch of speech-audio-ai (sibling to recognition, speech-to-text, and audio-music-generation) — and the founding, deepest corner of this wiki. Modern TTS is dominated by neural models that generate discrete audio tokens (a neural-audio-codec) with a transformer, then decode them to a waveform — increasingly reusing text-LLM backbones (Llama, gemini).

The competitive axes

No single model wins; the field sorts along constraints tts-models-2026-benchmark:

Voice quality / naturalness — measured by Elo & MOS (tts-benchmarks).
Latency — time-to-first-audio (TTFA); the gate for real-time agents (Sonic 3.5 ~82ms).
Accuracy — word/character error rate (WER/CER).
Language coverage — 15 → 100+ languages.
Capabilities — voice-cloning, emotion control, multi-speaker dialogue, streaming.
Cost & licensing — API price vs self-host; permissive vs research-only (open-weight-tts).

The market shape

A closed frontier (Google gemini Flash TTS, Cartesia Sonic, Inworld, ElevenLabs, OpenAI) leads on Elo, while an open-weight field (open-weight-tts: kokoro, fish-audio-s2-pro, orpheus, sesame-csm, misotts, …) competes on cost, control, and on-device deployment — structurally the same split llm-providers-wiki documents for text models. Providers: Google, OpenAI, xAI, Cartesia, Inworld, ElevenLabs, Hume, Deepgram, MiniMax (proprietary); Fish Audio, Hexgrad, Canopy AI, Sesame, Resemble AI, Miso Labs, Alibaba, Microsoft (open/research).

tts-benchmarks · open-weight-tts · voice-cloning · neural-audio-codec · tts-models-2026-benchmark · tts-arena-leaderboard

Text-to-speech (TTS)

The competitive axes

The market shape

Related

Linked from