Spokes.wiki Search Graph Growth About

speech-audio-wiki

Defined Term domain updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Speech & audio AI (umbrella)

The top-level subject of this wiki: AI models that produce or interpret speech and audio. Three branches, plus the cross-cutting dynamics that unify them.

The three branches

Composite / applied task (spans the branches):

Cross-cutting dynamics (the shared thesis)

The same structure recurs across all three:

  1. Closed frontier vs. open-weight field — proprietary leaders edge open weights on the headline metric (TTS Elo, STT WER, music Elo), but open models close the gap on cost/control (open-weight-tts, stable-audio, whisper).
  2. Per-axis evaluation — “best” is decided by a binding constraint: quality (Elo/MOS), accuracy (WER/CER), latency (TTFA / streaming RTFx), languages, capabilities, and cost (tts-benchmarks).
  3. Convergence on the LLM stack — TTS rides Llama/codec backbones (neural-audio-codec); STT fuses encoders with LLM decoders (SALM: canary-qwen, Qwen3-ASR). “Speech as language modeling” — the structural bridge to llm-providers-wiki (gemini straddles, as Gemini 3.x Flash TTS).
  4. License & consent as first-class — research-only ceilings (fish-audio-s2-pro) and the music copyright wars (ai-music-copyright); voice-cloning adds impersonation risk.

text-to-speech · speech-to-text · audio-music-generation · speech-to-speech-translation · tts-benchmarks · ai-music-copyright · open-weight-tts