The Top Open-Source Text-to-Speech Models (Modal)
Modal’s engineering roundup of the leading open-weight-tts models — the founding deep-dive on open-weight architectures and licensing for this wiki.
Models covered
- Higgs Audio V2 — 5.77B, BosonAI (Jul 2025); built on Llama 3.2 3B with a Dual-FFN design; Apache-2.0. Strong naturalness/emotion, robust voice-cloning, multi-speaker dialogue.
- kokoro v1.0 — 82M, Hexgrad (Jan 2025); Apache-2.0; smallest footprint, very cheap to run; no voice cloning.
- Dia — 1.6B, Nari Labs (Apr 2025); Apache-2.0, English-only; multi-speaker + nonverbal tags “(laughs)”; suits audiobook dialogue.
- Chatterbox — 0.5B, Resemble AI (May 2025); MIT; configurable, strong voice-cloning; the article’s recommended entry point for newcomers.
- orpheus — 3B/1B/400M/150M, Canopy AI (Mar 2025); Apache-2.0; zero-shot voice cloning, guided emotion, real-time streaming; multilingual variants.
- sesame-csm — 1B, Sesame Labs (Feb 2025); Apache-2.0; built on Llama; multi-speaker conversational focus.
Patterns worth noting
- Llama-as-backbone is common (Higgs, Orpheus, Sesame CSM) — TTS increasingly reuses text-LLM architectures, a direct bridge to llm-providers-wiki (gemini, open-weight Llama lineage).
- Permissive licensing dominates the open field (Apache-2.0 / MIT) — unlike the research-only license on the higher-Elo fish-audio-s2-pro; license, not just quality, is a selection axis.
- A capability spread: pure-efficiency (kokoro, no cloning) → all-rounder (Chatterbox) → expressive/streaming (orpheus) → conversational (sesame-csm).
Related
open-weight-tts · text-to-speech · voice-cloning · kokoro · orpheus · sesame-csm · fish-audio-s2-pro