Log — Speech & Audio AI Wiki

Append-only history. Each entry starts with ## [YYYY-MM-DD] <op> | <title> where <op> is ingest, query, lint, split, or broaden, so grep "^## \[" log.md | tail -5 works.

[2026-06-05] ingest | STT/ASR + audio/music generation sources (router-curated, user request)

User: “seek STT and audio generation sources.” Following the curate-at-request pattern, web-searched both landscapes and ingested 4 sources + 13 new pages, generalizing the wiki from TTS-only to all three branches of speech-audio-ai. Sources (URL-only, source: true):

open-source-stt-models (Northflank) — open STT/ASR + Open ASR Leaderboard WER.
stt-apis-comparison (Future AGI) — commercial STT APIs.
ai-music-generators-2026 (Chartlex) — music generators + copyright.
stable-audio-3 (MindStudio) — open-weight music generation. New pages (13): concepts speech-audio-ai (new top umbrella), speech-to-text, audio-music-generation, ai-music-copyright; models whisper, canary-qwen (STT), suno, udio, stable-audio (music). Updated text-to-speech (now one branch of the umbrella) + index + synthesis (thesis generalized: the closed-vs-open / per-axis / LLM-convergence / rights dynamics now shown across all three branches; key new insight — SALM STT decoders are literally Qwen3 LLMs, widening the llm-providers-wiki bridge; and music’s binding axis is copyright, the sharpest form of the rights theme). Wiki 13 → 26 pages. Site rebuilt + verified. Remaining gaps noted: voice-cloning safety, non-English depth, audio-understanding, neutral latency.

[2026-06-05] broaden | tts-wiki → speech-audio-wiki (user directive “broaden the domain”)

Hours after spin-out, the user directed broadening the domain. Widened scope from TTS only to all speech & audio AI models — TTS / speech synthesis (founding), STT / ASR (speech recognition), and audio / music generation — and renamed the spoke tts-wiki → speech-audio-wiki to match (the registry-vs-name drift rule; same play as the ai-search-wiki → search-marketing-wiki broaden+rename). Updated: directory mv; this spoke’s CLAUDE.md (title + domain header + boundary — STT/ASR & audio-gen now in scope, not parked), index.md (title + scope note), synthesis.md (title + thesis caveat + the former “out of scope” lines flipped to “in scope, not yet sourced”). Existing 13 TTS pages unchanged and still valid — TTS is now the deepest corner of a broader map. Slug-based wikilinks unaffected by the dir rename; no other wiki links into this one, so no external references to fix. Hub: wikis.md block renamed + broadened, hub log rename/broaden entry; spoke count unchanged (11). Clean rebuild + verify. Priority next sources: STT/ASR (Whisper/Deepgram) and audio/music gen (Suno/Udio/MusicGen) to generalize the thesis beyond TTS.

[2026-06-05] split | tts-wiki created from _inbox `speech-audio-models` (4 sources, router-curated)

User asked the router to “seek for more tts models knowledge.” With the parked misotts as seed (1), the router web-searched the TTS landscape and curated 3 quality sources → 4 total → spun out tts-wiki (the speech-synthesis model market — the speech sibling of llm-providers-wiki). Scaffolded from CLAUDE.template.md; registered in wikis.md; hub spoke count 10 → 11.

Sources ingested (all URL-only, source: true):

misotts — Miso Labs 8B emotive (RVQ/Mimi; 110ms; modified-MIT) — migrated from _inbox; model+source.
tts-models-2026-benchmark — MarkTechPost benchmark comparison (proprietary + open-weight).
tts-arena-leaderboard — Artificial Analysis neutral Elo Speech Arena board.
open-source-tts-models — Modal open-weight deep-dive (Higgs, Kokoro, Dia, Chatterbox, Orpheus, CSM).

Pages created (13 total): 5 concepts (text-to-speech, tts-benchmarks, open-weight-tts, voice-cloning, neural-audio-codec) + 5 models (misotts, kokoro, fish-audio-s2-pro, orpheus, sesame-csm) + 3 source summaries (misotts doubles as model+source).

Synthesis thesis: the TTS market rhymes with the text-LLM market (closed Elo frontier vs. open-weight field) with three twists — license decides at the open ceiling (fish-audio-s2-pro is research-only), latency (TTFA) is a make-or-break axis, and TTS is converging on the LLM stack (Llama backbones + RVQ neural-audio-codec). Cross-wiki bridges to llm-providers-wiki (gemini straddles both as Gemini 3.1 Flash TTS; open-weight-models/quantization). Deleted the parked _inbox/misotts.md. STT/ASR + music-gen explicitly out of scope. Site rebuilt + verified. Flagged tension: the two leaderboard sources disagree slightly on Elo (snapshot drift), recorded in synthesis.

[2026-06-09] ingest | Gemini 3.5 Live Translate (end-to-end speech-to-speech translation)

Hub-routed (Telegram, blog.google). New model/source page gemini-live-3-5-translate (SoftwareApplication, url) — Google’s end-to-end speech-to-speech translation audio model (70+ languages, 2000+ combinations, near-real-time/simultaneous, voice-preserving intonation/pacing/pitch, SynthID-watermarked; Gemini Live API + Google Translate + Meet, 2026-06). New concept speech-to-speech-translation (DefinedTerm) — the first composite/applied task in the wiki (STT+MT+TTS, streaming), added to the speech-audio-ai umbrella as a cross-branch task. Folded into synthesis (“composite tasks fuse the branches”; purest case of “LLM eats audio from both ends” — both ends in one model; latency-as-quality-tradeoff; SynthID introduces a provenance/watermark axis; gemini now spans TTS+STT+S2ST at the closed frontier). Cross-linked gemini (llm-providers, cross-wiki). Index updated. Caveat: vendor announcement; language/latency/voice claims not independently benchmarked. speech-audio-wiki 26 → 28 pages.

[2026-06-09] ingest | +2 safety axis + origin (audio deepfake, WaveNet) — all-spokes cron test

Filled the flagged “safety/consent” coverage gap and added the field’s origin point: audio-deepfake (DefinedTerm, src — voice-cloning misuse: CEO-voice fraud, fake-Biden robocalls, non-consensual cloning; ASVspoof detection + SynthID watermarking + FCC robocall ban — the lived form of “rights outweigh quality”) and wavenet (SoftwareApplication, src — DeepMind 2016, dilated causal convolutions, the neural-audio ancestor; its sample-by-sample latency → Parallel WaveNet is the first instance of the spoke’s TTFA tension). Synthesis coverage-gap question updated. Both Wikipedia url-only. 28 → 30 pages.

[2026-06-10] ingest | ElevenLabs + MusicGen — all-spokes pass (commercial archetype + open music reference)

Two pages at opposite ends of the closed-vs-open structure (spoke recently grown, kept to 2). elevenlabs (Organization, url, Wikipedia) — the leading commercial closed voice-AI company, referenced throughout but unpaged; spans all three branches (TTS/cloning/dubbing/Scribe STT/Eleven Music), $11B valuation Feb 2026; sits both sides of the rights axis (cloning → audio-deepfake + ships an AI Speech Classifier detector). musicgen (SoftwareApplication, url, Meta announcement — MusicGen/AudioCraft Wikipedia pages 404’d, substituted the official Meta source) — the open research reference for the music branch: single-stage transformer over EnCodec RVQ tokens, 300M/1.5B/3.3B, melody-conditioned; MIT code but CC-BY-NC (non-commercial) weights = open-but-not-shippable, the music echo of TTS’s research-license ceiling; clean-rights training data (clear of ai-music-copyright suits hitting suno). Folded into synthesis (new 2026-06-10 section) + index (new Organization group + music row). No contradictions. 30 → 32 pages.

[2026-06-12] ingest | Audio Flamingo 3 (NVIDIA) — audio-understanding LALM

All-spokes daily expansion. Added audio-flamingo-3 (@type SoftwareApplication) — the wiki’s first audio-understanding model, opening a fourth branch (comprehension) beside TTS/STT/music. A fully-open Large Audio-Language Model that reasons over speech+sound+music (AF-Whisper unified encoder, on-demand CoT, ~10-min audio, voice-to-voice; SOTA on 20+ benchmarks; trained on open data). Completes “the LLM eats audio from both ends” (the comprehension vertex) and re-confirms “license, not score, decides” (open weights+data but non-commercial — same ceiling as fish-audio-s2-pro/musicgen). Seeds the flagged “audio understanding beyond transcription” gap. synthesis “fourth branch” note; index gains an Audio- understanding row. 1 new page. Caveat: vendor-reported SOTA, young benchmarks, non-commercial license.

[2026-06-13] ingest | ElevenLabs Expressive Mode / Conversational AI (join.elevenlabs.io)

Routed from hub (Telegram drop). ElevenLabs’ real-time conversational-voice-agent product: Eleven V3 Conversational TTS + Scribe v2 Realtime STT, emotion inferred from prosody (“tone cue cards”), 70+ languages, “ultra-low latency.” Quality gate: tier T3 (vendor landing/signup ad — CSAT/NPS/latency claims unverified; model names are first-party specs); elevenlabs already paged → new source summary elevenlabs-expressive-mode (not a dedup of the Org page); gap-relevance — advances the open “how real are latency claims?” question and adds an emotion/affect quality axis; integrated as the second composite task beside S2ST in synthesis + a new product bullet on elevenlabs. freshness: volatile. url provenance. Site rebuild + commit follow.

[2026-06-14] ingest | GPT-Realtime-2 / OpenAI WebRTC Audio Session (simonwillison.net)

Routed from hub (Telegram drop). OpenAI’s GPT-Realtime-2 voice model for the Realtime API — speech-to-speech over WebRTC, “first voice model with GPT-5-class reasoning,” plus a document-context feature (paste text, explore by voice). Quality gate: tier T4 (Simon Willison personal-blog demo write-up; primary is OpenAI’s own release); not previously paged → new model page gpt-realtime-2. Gap-relevance: third entrant in the real-time conversational-voice thread (after elevenlabs-expressive-mode + gemini-live-3-5-translate) — the loop collapsed into one end-to-end model carrying frontier reasoning; adds a grounding axis (document context). Integrated into synthesis (conversational composite thread). Cross-spoke: OpenAI text frontier → llm-providers; RAG → research-wiki (noted, not paged). freshness: volatile. Site rebuild + commit follow.

[2026-06-15] ingest | Whisper primary paper (Radford et al. 2022) — T1 anchor for the STT baseline

Quality cycle, T1-floor raise. whisper was a concept page sourced only from comparison roundups (open-source-stt-models, stt-apis-comparison, T3/T4); added the primary paper as a source: whisper-paper (ScholarlyArticle, T1) — Radford, Kim, Xu, Brockman, McLeavey, Sutskever, Robust Speech Recognition via Large-Scale Weak Supervision, arXiv:2212.04356 (ICML 2023). Grounds why Whisper is the multilingual default: weak-supervision-at-scale (680k hours), zero-shot transfer with no fine-tuning, human-approaching robustness, MIT release — not WER leadership. Threaded into synthesis (“LLM eats audio from both ends” recurring read: SALM successors beat it on accuracy yet can’t displace the weak-supervision baseline). Concept/paper split mirrors optimization (NFL) + llm-inference (FlashAttention). Found via WebSearch; figures from the abstract, architecture from established knowledge (noted on page). Linked from concept page, synthesis, index (new ScholarlyArticle section). 1 new page.