Spokes.wiki Search Graph Growth About

speech-audio-wiki

Defined Term domain updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Text-to-speech (TTS)

Synthesizing natural speech audio from text (optionally conditioned on a reference voice or emotional cue). The synthesis branch of speech-audio-ai (sibling to recognition, speech-to-text, and audio-music-generation) — and the founding, deepest corner of this wiki. Modern TTS is dominated by neural models that generate discrete audio tokens (a neural-audio-codec) with a transformer, then decode them to a waveform — increasingly reusing text-LLM backbones (Llama, gemini).

The competitive axes

No single model wins; the field sorts along constraints tts-models-2026-benchmark:

The market shape

A closed frontier (Google gemini Flash TTS, Cartesia Sonic, Inworld, ElevenLabs, OpenAI) leads on Elo, while an open-weight field (open-weight-tts: kokoro, fish-audio-s2-pro, orpheus, sesame-csm, misotts, …) competes on cost, control, and on-device deployment — structurally the same split llm-providers-wiki documents for text models. Providers: Google, OpenAI, xAI, Cartesia, Inworld, ElevenLabs, Hume, Deepgram, MiniMax (proprietary); Fish Audio, Hexgrad, Canopy AI, Sesame, Resemble AI, Miso Labs, Alibaba, Microsoft (open/research).

tts-benchmarks · open-weight-tts · voice-cloning · neural-audio-codec · tts-models-2026-benchmark · tts-arena-leaderboard