Audio & music generation
Generating music and audio from text (or reference audio). The generation branch of speech-audio-ai, distinct from speech synthesis (text-to-speech) in producing songs, instrumentals, and sound design rather than spoken words.
The field (2026)
- suno — the quality leader (v5, Elo ~1293); best vocal songs from one prompt; $2.45B valuation; the commercial flagship.
- udio — closest competitor; strong stems; the licensing-clean option post-UMG settlement.
- stable-audio — Stability AI; open-weight, instrumental, sound-design focus; the open wedge (stable-audio-3).
- ElevenLabs Music (multilingual, modular regen), AIVA (cinematic/score, full IP ownership).
What makes this branch different
- Copyright is the dominant axis, not just quality — see ai-music-copyright. Whether you can use the output legally varies more than how good it sounds (Suno litigation vs Udio settlement vs Stable Audio’s licensed data).
- Vocals vs instrumental split — the song generators (suno, udio, ElevenLabs) do vocals; the open/instrumental tools (stable-audio, AIVA) don’t.
- A growing but contested market — ~$0.57B (2024) → ~$1.98B (2026), yet AI tracks show 25–40% lower save / 15–25% higher skip rates than human recordings ai-music-generators-2026 — demo tool, not finished-product (yet).
Shared with the rest of the wiki
Same open-vs-closed structure (stable-audio is the open wedge, as fish-audio-s2-pro is for TTS) and Elo-style ranking (tts-benchmarks) — but copyright, not license-terms alone, is the sharper constraint here.
Related
speech-audio-ai · ai-music-copyright · suno · udio · stable-audio · ai-music-generators-2026