Spokes.wiki Search Graph Growth About

speech-audio-wiki

Software Application source ↗ source url updated Fri Jun 12 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Audio Flamingo 3

The wiki’s first audio-understanding model — closing the flagged gap “audio understanding beyond transcription (audio LLMs).” AF3 (NVIDIA, July 2025) is a fully open Large Audio-Language Model (LALM): it doesn’t transcribe (STT) or synthesize (TTS) — it reasons about audio (speech, environmental sound, and music) in natural language. This adds a fourth branch to speech-audio-ai beside synthesis / recognition / generation: comprehension.

What it does

Why it matters here

Caveat

Vendor (NVIDIA) self-reported SOTA on its own/standard benchmarks; “fully open” = open weights + data, but non-commercial licensing limits production use. Audio-understanding benchmarks are young and contested (cf. SonicBench-style critiques of LALM physical-perception limits).

speech-audio-ai · speech-to-text · canary-qwen · whisper · fish-audio-s2-pro · musicgen · neural-audio-codec