speech-audio-wiki

Synthesis — Speech & Audio AI

The evolving thesis. Spun out as tts-wiki from the hub _inbox speech-audio-models cluster on 2026-06-05 — seeded by the parked misotts and grown with three TTS field sources the router curated at the user’s request — then broadened to all speech & audio AI and renamed speech-audio-wiki the same day, and immediately grown with four more router-curated sources covering STT/ASR and audio/music generation. The wiki now spans the three production branches of speech-audio-ai — synthesis, recognition, generation — plus an emerging fourth, audio comprehension (audio-flamingo-3); TTS is still the deepest corner.

Current thesis

The 2026 speech & audio AI market — across text-to-speech (synthesis), speech-to-text (recognition), and audio-music-generation — rhymes with the text-LLM market (llm-providers-wiki): a closed frontier leading on the headline metric, an open-weight field competing on cost, control, and licensing. The same four dynamics recur in all three branches (see speech-audio-ai):

A closed frontier on the headline metric — but a thin, contested lead. Proprietary leaders top each branch’s ranking: TTS Elo (tts-arena-leaderboard: Fun-Realtime-TTS, gemini 3.1 Flash TTS, Cartesia Sonic), STT WER (ElevenLabs Scribe ~3.3% EN, Deepgram 5.26% — stt-apis-comparison), music Elo (suno v5 ~1293). Yet the open field is close behind and closing, and on STT the open leader canary-qwen (5.63%) nearly matches the commercial APIs.
License/rights — not just the score — decides. The constraint sharpens as you move across branches: TTS has a research-license ceiling (fish-audio-s2-pro is the top open model but paid-commercial); STT is mostly permissive (whisper MIT, Canary CC-BY); music escalates to outright copyright war (ai-music-copyright: suno in Sony litigation vs the clean udio / stable-audio). The best-sounding option is often the least legally safe.
“Best” is multi-axis and weekly-volatile. Each branch sorts by its own binding constraint — quality (Elo/MOS), accuracy (WER/CER), latency (TTFA / streaming RTFx), languages, capabilities, cost (tts-benchmarks). “No single model wins.” Every figure here is a dated snapshot.
Convergence on the LLM stack. TTS rides Llama backbones + RVQ neural-audio-codec; STT’s accuracy leaders are SALM models bolting an LLM decoder onto a speech encoder (canary-qwen = FastConformer + Qwen3; Granite-Speech; Qwen3-ASR). “Speech as language modeling” in both directions — the structural bridge to llm-providers-wiki (gemini straddles). A fourth branch — audio comprehension — now completes the pattern: audio-flamingo-3 (NVIDIA, a fully-open Large Audio-Language Model) doesn’t transcribe or synthesize, it reasons over speech, sound, and music (unified AF-Whisper encoder, on-demand chain-of-thought, ~10-min audio), SOTA on 20+ understanding benchmarks. Where codec-token TTS and SALM STT were the production directions of “the LLM eats audio from both ends,” AF3 is the pure comprehension vertex — and it re-confirms dynamic 2 (license, not score, decides): open weights and data, but a non-commercial research license, the same shippability ceiling as fish-audio-s2-pro and musicgen.

Unifying tension: as in text, capability concentrates at a few closed providers while cost and access are democratized from below by open weights — but audio adds two twists text rarely faces: latency (TTFA / RTFx) as a make-or-break product axis, and rights (research-license ceilings, voice-cloning consent, music copyright) as a constraint that can outweigh quality outright.

Recurring reads

Efficiency vs. controllability split — the smallest models (kokoro 82M) drop voice-cloning to hit footprint; larger models add cloning/emotion/streaming. Capability tends to cost parameters. (STT mirror: distilled/streaming models like Parakeet/Distil-Whisper trade some accuracy for huge RTFx.)
The LLM is eating audio from both ends — codec-token TTS (neural-audio-codec: misotts‘s Mimi, fish-audio-s2-pro‘s dual-AR+RVQ) and SALM STT (canary-qwen‘s Qwen3 decoder). Both directions are now “next-token over audio/text,” importing llm-providers-wiki’s architectures. The STT baseline they build past, whisper, is now primary-grounded (whisper-paper, Radford et al. 2022): its dominance traces to weak-supervision-at-scale (680k hours, zero-shot), not to topping WER — which is exactly why the SALM successors can beat it on accuracy yet not displace it.
Latency is a product axis — TTFA (~82ms Sonic, ~90ms Aura-2, 110ms MisoTTS) and STT RTFx (Parakeet >2,000×) gate real-time agents — a constraint with no clean text-LLM analog.
The open wedge appears in every branch — kokoro/fish-audio-s2-pro (TTS), whisper/canary-qwen (STT), stable-audio/musicgen (music): the self-hostable option trading headline polish for cost/control/clean-licensing. musicgen (Meta AudioCraft) is the open music reference — a single-stage transformer over EnCodec RVQ tokens (the music-branch instance of the LLM-stack convergence), but MIT code / CC-BY-NC weights — open-weight yet not shippable, the music echo of TTS’s research-license ceiling, with a clean-rights training set that keeps it clear of the litigation hitting suno.
The closed-frontier archetype is elevenlabs — the commercial leader the spoke keeps citing (Scribe WER, TTS Elo) and the rare provider spanning all branches (TTS + voice cloning + dubbing
- Scribe STT + Eleven Music), at an $11B valuation (Feb 2026). It sits on both sides of the rights axis: its cloning powers audio-deepfake fraud, yet it ships an AI Speech Classifier to detect synthetic audio.

Open questions

Does the open ceiling go permissive? If a future Apache-2.0/MIT model matches fish-audio-s2-pro‘s Elo, the research-license ceiling collapses — the key thing to watch.
How real are the latency claims? TTFA figures are largely vendor-reported under unstated conditions; a neutral latency benchmark would be high-value.
Remaining coverage gaps: safety/consent for voice-cloning now sourced (2026-06-09): audio-deepfake adds the misuse/consent/detection axis (CEO-voice fraud, fake-Biden robocalls, non-consensual cloning; ASVspoof detection + SynthID watermarking + the FCC robocall ban) — the dark mirror of the cloning capability, and the lived form of “rights can outweigh quality.” Also added wavenet (DeepMind 2016), the origin point of neural audio the whole stack descends from — its sample-by-sample latency problem is the first instance of the spoke’s TTFA tension. Still thin: non-English depth, audio understanding beyond transcription (audio LLMs), and a neutral latency benchmark.
Does the open frontier flip? Watch whether a permissive open model takes a branch’s headline metric — closest on STT (canary-qwen 5.63% vs commercial ~3–5%), furthest on music (closed suno well ahead). The TTS research-license ceiling (fish-audio-s2-pro) is the other thing to watch.
Snapshot drift: rankings disagree across sources (TTS Elo: Gemini 1216 vs 1217; STT WER varies by test set) — treat all as dated.

Contradictions / tensions

Leaderboard disagreement (minor). tts-models-2026-benchmark and tts-arena-leaderboard report slightly different Elo values and orderings (e.g. whether Fun-Realtime-TTS or Gemini leads). Not a fact conflict — different snapshot dates and vote pools. Recorded, not resolved; both cited with dates.

Composite tasks fuse the branches (added 2026-06-09)

The three branches aren’t only sold separately — they compose. speech-to-speech-translation (S2ST) is the first applied/composite task in the wiki: STT + machine translation + TTS in one pipeline. Its 2026 exemplar, gemini-live-3-5-translate (Google; 70+ languages, 2000+ pairs, near-real-time, voice-preserving, SynthID-watermarked), is the purest case of “the LLM eats audio from both ends” — both ends in one streamed, end-to-end model. Two threads it sharpens: (1) latency becomes a quality trade-off, not just a number — simultaneous translation must choose between waiting for context and translating immediately; (2) provenance/watermarking (SynthID on all output) enters the wiki as an audio-safety axis (alongside voice-cloning consent and music copyright). It also extends the closed-frontier read: gemini now leads (by its own framing) at TTS, plays at STT, and stakes the S2ST frontier — a single proprietary provider spanning all of it.

Second composite (2026-06-13): the real-time conversational agent. elevenlabs-expressive-mode (ElevenLabs’ Conversational AI) is the dialogue sibling of S2ST’s translation loop: Scribe v2 Realtime STT → reasoning → Eleven V3 Conversational TTS in one low-latency turn-by-turn loop. It adds a genuinely new axis to the spoke — emotion/affect: the agent infers emotion from prosody (pitch, pacing, exclamations) and applies “tone cue cards,” making paralinguistic understanding the headline differentiator rather than fidelity or WER. Two thesis hooks: (1) it confirms the branches compose into products — leaders increasingly sell the integrated loop, not components; (2) another “ultra-low latency” claim with unstated conditions (T3 vendor page) — more weight on the still-open call for a neutral latency benchmark. Caveat: a marketing landing page; CSAT/latency figures unverified.

Third entrant (2026-06-14): a single-model frontier voice agent. gpt-realtime-2 (OpenAI’s Realtime-API voice model, “GPT-5-class reasoning,” speech-to-speech over WebRTC) is the conversational loop collapsed into one end-to-end model — the gemini-live-3-5-translate shape aimed at general dialogue rather than translation, and the contrast to ElevenLabs’ assembled STT→LLM→TTS pipeline. It pushes the “the LLM eats audio from both ends” read to its limit: a speech-to-speech network carrying the reasoning of the text frontier, not just its tokens. Its document-context feature also adds a grounding axis to spoken dialogue (answer by voice about pasted text) — retrieval entering the voice loop. So the real-time-voice frontier now has three closed entrants (Google, ElevenLabs, OpenAI). Caveat: secondhand practitioner source (T4), no latency/quality numbers.

Cross-spoke adjacency

llm-providers-wiki — the text/multimodal LLM market. This spoke is its speech sibling: shared architecture (Llama backbones, open-weight-models dynamics, quantization/codec ideas) and the same closed-vs-open structure. gemini straddles both — Gemini 3.1 Flash TTS is a TTS leader and a Gemini model; linked cross-wiki, not duplicated.
llm-inference-wiki — TTS latency/streaming and codec decoding are an inference-mechanics story; adjacent but not yet bridged.
Now covered (after the 2026-06-05 broaden + ingest): STT/ASR (speech-to-text: whisper, canary-qwen) and audio/music generation (audio-music-generation: suno, udio, stable-audio) are sourced and integrated under the speech-audio-ai umbrella. The cross-wiki bridge to llm-providers-wiki widened: SALM STT decoders are literally Qwen3 LLMs.

Index — Speech & Audio AI Wiki

Catalog of every page, grouped by schema.org @type. Spine: synthesis (thesis), log.md (history), this file (catalog). Domain = speech & audio AI models across three branches — TTS (synthesis · deepest), STT/ASR (recognition), audio/music generation. Some wiki-links resolve cross-wiki to llm-providers-wiki (gemini, open-weight-models, quantization) — intentional bridge links. Elo / WER / latency facts are dated snapshots.

DefinedTerm (concepts)

speech-audio-ai — top umbrella: the three branches + the cross-cutting dynamics · domain
TTS: text-to-speech (synthesis branch) · tts-benchmarks (Elo/MOS/WER/TTFA) · open-weight-tts (self-hostable segment) · voice-cloning (speaker cloning) · neural-audio-codec (RVQ audio tokens)
STT: speech-to-text (recognition branch; WER, Open ASR Leaderboard, SALM trend)
Music/audio: audio-music-generation (generation branch) · ai-music-copyright (the licensing/litigation fault line)
Composite/applied: speech-to-speech-translation (S2ST — STT+MT+TTS, streaming; spans the branches)
Safety: audio-deepfake (voice-cloning misuse: fraud/disinfo/consent; detection + watermarking) · source

SoftwareApplication (models)

TTS: misotts (Miso 8B emotive; RVQ/Mimi · source) · kokoro (82M efficiency leader) · fish-audio-s2-pro (5B; highest-Elo open) · orpheus (Llama family; cloning+streaming) · sesame-csm (1B conversational) · wavenet (DeepMind 2016; the foundational neural-audio ancestor · source)
STT: whisper (OpenAI; MIT; ecosystem default) · canary-qwen (NVIDIA; tops Open ASR Leaderboard 5.63%; SALM)
Music: suno (leader, Elo ~1293; Sony litigation) · udio (licensing-clean; UMG deal) · stable-audio (Stability AI; open-weight, instrumental) · musicgen (Meta AudioCraft; open research reference; MIT code / CC-BY-NC weights · source)
S2ST: gemini-live-3-5-translate (Google; end-to-end speech-to-speech translation, 70+ langs, voice-preserving, SynthID · source)
Conversational S2S: gpt-realtime-2 (OpenAI; Realtime-API speech-to-speech, GPT-5-class reasoning, WebRTC, document-context grounding · source · T4)
Audio understanding (LALM): audio-flamingo-3 (NVIDIA; reasons over speech+sound+music; AF-Whisper encoder, CoT, 10-min audio; open weights/data, non-commercial; the 4th branch · source)

Organization (providers)

elevenlabs — leading commercial voice-AI company; TTS + cloning + dubbing + Scribe STT + Eleven Music; closed; $11B (2026) · source · wikipedia

ScholarlyArticle (sources)

whisper-paper — Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (OpenAI, 2022); the primary behind whisper — 680k hours, zero-shot, no fine-tuning · source · T1 · arxiv.org

WebPage (sources)

elevenlabs-expressive-mode — ElevenLabs Conversational AI / Expressive Mode: V3 Conversational TTS + Scribe v2 Realtime STT; emotion-from-prosody; the real-time conversational composite · source · T3 · join.elevenlabs.io

Article / BlogPosting / Dataset (sources)

tts-models-2026-benchmark — MarkTechPost: TTS benchmark comparison (proprietary + open) · source · marktechpost.com
tts-arena-leaderboard — Artificial Analysis: neutral TTS Elo Speech Arena board · source · artificialanalysis.ai
open-source-tts-models — Modal: open-weight TTS deep-dive (Higgs, Kokoro, Dia, Chatterbox, Orpheus, CSM) · source · modal.com
open-source-stt-models — Northflank: open-source STT/ASR + Open ASR Leaderboard WER · source · northflank.com
stt-apis-comparison — Future AGI: commercial STT APIs (Deepgram, AssemblyAI, Chirp, ElevenLabs) · source · futureagi.com
ai-music-generators-2026 — Chartlex: music generators + the copyright fault line · source · chartlex.com
stable-audio-3 — MindStudio: Stable Audio 3.0 open-weight music generation · source · mindstudio.ai

Synthesis

synthesis — the thesis: across all three branches, a closed frontier vs. an open-weight wedge; rights/latency as audio-specific axes; convergence on the LLM stack

Bridge nodes (live in sibling wikis, linked cross-wiki)

gemini · open-weight-models · quantization (llm-providers-wiki)