Defined Term domain updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Speech & audio AI (umbrella)

The top-level subject of this wiki: AI models that produce or interpret speech and audio. Three branches, plus the cross-cutting dynamics that unify them.

The three branches

text-to-speech (TTS / synthesis) — text → speech audio. The founding, deepest corner (kokoro, fish-audio-s2-pro, orpheus, sesame-csm, misotts; Gemini/Cartesia/ElevenLabs).
speech-to-text (STT / ASR) — speech audio → text. whisper, canary-qwen, NVIDIA Parakeet, IBM Granite; Deepgram/AssemblyAI/Google Chirp APIs.
audio-music-generation — text → music/audio. suno, udio, stable-audio, ElevenLabs Music, AIVA.

Composite / applied task (spans the branches):

speech-to-speech-translation (S2ST) — speech in → speech out across languages (STT + translation + TTS, increasingly end-to-end & streaming). Exemplar: gemini-live-3-5-translate.

Cross-cutting dynamics (the shared thesis)

The same structure recurs across all three:

Closed frontier vs. open-weight field — proprietary leaders edge open weights on the headline metric (TTS Elo, STT WER, music Elo), but open models close the gap on cost/control (open-weight-tts, stable-audio, whisper).
Per-axis evaluation — “best” is decided by a binding constraint: quality (Elo/MOS), accuracy (WER/CER), latency (TTFA / streaming RTFx), languages, capabilities, and cost (tts-benchmarks).
Convergence on the LLM stack — TTS rides Llama/codec backbones (neural-audio-codec); STT fuses encoders with LLM decoders (SALM: canary-qwen, Qwen3-ASR). “Speech as language modeling” — the structural bridge to llm-providers-wiki (gemini straddles, as Gemini 3.x Flash TTS).
License & consent as first-class — research-only ceilings (fish-audio-s2-pro) and the music copyright wars (ai-music-copyright); voice-cloning adds impersonation risk.

text-to-speech · speech-to-text · audio-music-generation · speech-to-speech-translation · tts-benchmarks · ai-music-copyright · open-weight-tts

Speech & audio AI (umbrella)

The three branches

Cross-cutting dynamics (the shared thesis)

Related

Linked from