Speech & audio AI (umbrella)
The top-level subject of this wiki: AI models that produce or interpret speech and audio. Three branches, plus the cross-cutting dynamics that unify them.
The three branches
- text-to-speech (TTS / synthesis) — text → speech audio. The founding, deepest corner (kokoro, fish-audio-s2-pro, orpheus, sesame-csm, misotts; Gemini/Cartesia/ElevenLabs).
- speech-to-text (STT / ASR) — speech audio → text. whisper, canary-qwen, NVIDIA Parakeet, IBM Granite; Deepgram/AssemblyAI/Google Chirp APIs.
- audio-music-generation — text → music/audio. suno, udio, stable-audio, ElevenLabs Music, AIVA.
Composite / applied task (spans the branches):
- speech-to-speech-translation (S2ST) — speech in → speech out across languages (STT + translation + TTS, increasingly end-to-end & streaming). Exemplar: gemini-live-3-5-translate.
Cross-cutting dynamics (the shared thesis)
The same structure recurs across all three:
- Closed frontier vs. open-weight field — proprietary leaders edge open weights on the headline metric (TTS Elo, STT WER, music Elo), but open models close the gap on cost/control (open-weight-tts, stable-audio, whisper).
- Per-axis evaluation — “best” is decided by a binding constraint: quality (Elo/MOS), accuracy (WER/CER), latency (TTFA / streaming RTFx), languages, capabilities, and cost (tts-benchmarks).
- Convergence on the LLM stack — TTS rides Llama/codec backbones (neural-audio-codec); STT fuses encoders with LLM decoders (SALM: canary-qwen, Qwen3-ASR). “Speech as language modeling” — the structural bridge to llm-providers-wiki (gemini straddles, as Gemini 3.x Flash TTS).
- License & consent as first-class — research-only ceilings (fish-audio-s2-pro) and the music copyright wars (ai-music-copyright); voice-cloning adds impersonation risk.
Related
text-to-speech · speech-to-text · audio-music-generation · speech-to-speech-translation · tts-benchmarks · ai-music-copyright · open-weight-tts
Linked from
- log
- synthesis
- index
- ai-music-copyright
- ai-music-generators-2026
- audio-flamingo-3
- audio-deepfake
- audio-music-generation
- canary-qwen
- elevenlabs
- gemini-live-3-5-translate
- musicgen
- open-source-stt-models
- speech-to-speech-translation
- speech-to-text
- stable-audio-3
- stable-audio
- suno
- stt-apis-comparison
- text-to-speech
- udio
- wavenet
- whisper-paper
- whisper