Audio Flamingo 3
The wiki’s first audio-understanding model — closing the flagged gap “audio understanding beyond transcription (audio LLMs).” AF3 (NVIDIA, July 2025) is a fully open Large Audio-Language Model (LALM): it doesn’t transcribe (STT) or synthesize (TTS) — it reasons about audio (speech, environmental sound, and music) in natural language. This adds a fourth branch to speech-audio-ai beside synthesis / recognition / generation: comprehension.
What it does
- Unified audio reasoning across speech, sound, and music via AF-Whisper, a single encoder trained for joint representation over all three (most models specialize in one).
- On-demand chain-of-thought “thinking” — it can reason step-by-step about audio before answering.
- Long-audio understanding up to ~10 minutes, multi-turn / multi-audio chat, and voice-to-voice interaction.
- SOTA on 20+ audio-understanding & reasoning benchmarks, trained only on open-source audio data (datasets AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat; a five-stage curriculum).
Why it matters here
- Completes the “LLM eats audio from both ends” thesis. The synthesis tracks codec-token TTS and SALM STT (canary-qwen); AF3 is the pure comprehension vertex — an LLM whose input modality is arbitrary audio and whose output is reasoning. “Speech as language modeling” generalizes to “audio as language modeling.”
- Same closed-vs-open / rights structure, again. AF3 is open weights but non-commercial research license — the exact research-license ceiling the wiki flags in TTS and the CC-BY-NC ceiling in music. The best open option is once more the least commercially-shippable — “license, not score, decides,” now in the understanding branch too.
- Bridges hard to llm-providers-wiki: an LALM is an LLM with an audio front-end; AF3’s openness + curriculum training mirror the open-weight-models dynamics of the text market.
Caveat
Vendor (NVIDIA) self-reported SOTA on its own/standard benchmarks; “fully open” = open weights + data, but non-commercial licensing limits production use. Audio-understanding benchmarks are young and contested (cf. SonicBench-style critiques of LALM physical-perception limits).
Related
speech-audio-ai · speech-to-text · canary-qwen · whisper · fish-audio-s2-pro · musicgen · neural-audio-codec