open-source-tts-models · speech-audio-wiki

Blog Posting source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Modal’s engineering roundup of the leading open-weight-tts models — the founding deep-dive on open-weight architectures and licensing for this wiki.

Models covered

Higgs Audio V2 — 5.77B, BosonAI (Jul 2025); built on Llama 3.2 3B with a Dual-FFN design; Apache-2.0. Strong naturalness/emotion, robust voice-cloning, multi-speaker dialogue.
kokoro v1.0 — 82M, Hexgrad (Jan 2025); Apache-2.0; smallest footprint, very cheap to run; no voice cloning.
Dia — 1.6B, Nari Labs (Apr 2025); Apache-2.0, English-only; multi-speaker + nonverbal tags “(laughs)”; suits audiobook dialogue.
Chatterbox — 0.5B, Resemble AI (May 2025); MIT; configurable, strong voice-cloning; the article’s recommended entry point for newcomers.
orpheus — 3B/1B/400M/150M, Canopy AI (Mar 2025); Apache-2.0; zero-shot voice cloning, guided emotion, real-time streaming; multilingual variants.
sesame-csm — 1B, Sesame Labs (Feb 2025); Apache-2.0; built on Llama; multi-speaker conversational focus.

Patterns worth noting

Llama-as-backbone is common (Higgs, Orpheus, Sesame CSM) — TTS increasingly reuses text-LLM architectures, a direct bridge to llm-providers-wiki (gemini, open-weight Llama lineage).
Permissive licensing dominates the open field (Apache-2.0 / MIT) — unlike the research-only license on the higher-Elo fish-audio-s2-pro; license, not just quality, is a selection axis.
A capability spread: pure-efficiency (kokoro, no cloning) → all-rounder (Chatterbox) → expressive/streaming (orpheus) → conversational (sesame-csm).

open-weight-tts · text-to-speech · voice-cloning · kokoro · orpheus · sesame-csm · fish-audio-s2-pro

The Top Open-Source Text-to-Speech Models (Modal)

Models covered

Patterns worth noting

Related

Linked from