Fish Audio S2 Pro
The highest-ranked open-weight text-to-speech model on the tts-arena-leaderboard (Elo ~1123–1128) — from Fish Audio. A 5B-parameter model.
Profile
- Architecture: Dual-Autoregressive with an RVQ neural-audio-codec (the same codec-token family as misotts).
- Scale: trained on 10M+ hours across 80+ languages; ~3.5% WER / 1.2% CER (English, vendor) tts-models-2026-benchmark.
- Supports voice-cloning.
The licensing catch
Its weights are open, but under a research license — commercial use requires a separate paid license. So the best-scoring open model is not freely commercial, which is exactly why open-weight-tts treats license as a first-class axis: teams needing permissive terms drop to Apache-2.0/MIT models (kokoro, orpheus, sesame-csm) and accept lower Elo.
Place in the field
Fish Audio S2 Pro marks the open-weight ceiling in 2026 — close enough to pressure the closed frontier on quality, still a notch below the API leaders (gemini Flash TTS, Cartesia Sonic) and encumbered on commercial use.
Related
open-weight-tts · text-to-speech · tts-arena-leaderboard · neural-audio-codec · voice-cloning · kokoro