Defined Term concept updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Open-weight TTS

The segment of text-to-speech models whose weights are downloadable and self-hostable — the speech analog of llm-providers-wiki’s open-weight-models. Defined by two facts in 2026:

1. It trails the closed frontier on Elo — but not by much

No open-weight model cracks the tts-arena-leaderboard top tier (all closed/API-only). The open leader is fish-audio-s2-pro (Elo ~1123–1128, 5B) — capable but below Google/Cartesia/Inworld. The same proprietary-premium-vs-open-wedge dynamic llm-providers-wiki tracks for text holds for voice.

2. License is a first-class axis, not a footnote

The open field splits on terms:

Permissive (Apache-2.0 / MIT): kokoro, orpheus, sesame-csm, Chatterbox, Dia, Higgs Audio V2 — freely commercial-usable open-source-tts-models.
Research-only / paid-commercial: fish-audio-s2-pro (the highest-Elo open model is not freely commercial) and misotts‘s “modified MIT.” So the practical “best open model” depends on whether you can use it commercially, not just its score.

The shape of the field

Efficiency-first (kokoro 82M, no voice-cloning) → all-rounder (Chatterbox 0.5B) → expressive/streaming (orpheus) → conversational (sesame-csm) → emotive (misotts) → top-quality but restricted (fish-audio-s2-pro). Many are Llama-based — a bridge to the text-LLM market (gemini cross-wiki).

text-to-speech · tts-benchmarks · tts-arena-leaderboard · kokoro · fish-audio-s2-pro · open-source-tts-models

Open-weight TTS

1. It trails the closed frontier on Elo — but not by much

2. License is a first-class axis, not a footnote

The shape of the field

Related

Linked from