Text-to-speech (TTS)
Synthesizing natural speech audio from text (optionally conditioned on a reference voice or emotional cue). The synthesis branch of speech-audio-ai (sibling to recognition, speech-to-text, and audio-music-generation) — and the founding, deepest corner of this wiki. Modern TTS is dominated by neural models that generate discrete audio tokens (a neural-audio-codec) with a transformer, then decode them to a waveform — increasingly reusing text-LLM backbones (Llama, gemini).
The competitive axes
No single model wins; the field sorts along constraints tts-models-2026-benchmark:
- Voice quality / naturalness — measured by Elo & MOS (tts-benchmarks).
- Latency — time-to-first-audio (TTFA); the gate for real-time agents (Sonic 3.5 ~82ms).
- Accuracy — word/character error rate (WER/CER).
- Language coverage — 15 → 100+ languages.
- Capabilities — voice-cloning, emotion control, multi-speaker dialogue, streaming.
- Cost & licensing — API price vs self-host; permissive vs research-only (open-weight-tts).
The market shape
A closed frontier (Google gemini Flash TTS, Cartesia Sonic, Inworld, ElevenLabs, OpenAI) leads on Elo, while an open-weight field (open-weight-tts: kokoro, fish-audio-s2-pro, orpheus, sesame-csm, misotts, …) competes on cost, control, and on-device deployment — structurally the same split llm-providers-wiki documents for text models. Providers: Google, OpenAI, xAI, Cartesia, Inworld, ElevenLabs, Hume, Deepgram, MiniMax (proprietary); Fish Audio, Hexgrad, Canopy AI, Sesame, Resemble AI, Miso Labs, Alibaba, Microsoft (open/research).
Related
tts-benchmarks · open-weight-tts · voice-cloning · neural-audio-codec · tts-models-2026-benchmark · tts-arena-leaderboard
Linked from
- log
- synthesis
- index
- audio-flamingo-3
- audio-deepfake
- audio-music-generation
- canary-qwen
- elevenlabs-expressive-mode
- elevenlabs
- fish-audio-s2-pro
- gemini-live-3-5-translate
- gpt-realtime-2
- misotts
- kokoro
- neural-audio-codec
- open-source-tts-models
- open-source-stt-models
- open-weight-tts
- orpheus
- sesame-csm
- speech-audio-ai
- speech-to-speech-translation
- speech-to-text
- stt-apis-comparison
- tts-arena-leaderboard
- tts-benchmarks
- tts-models-2026-benchmark
- voice-cloning
- wavenet