Software Application source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

MisoTTS

Miso Labs’ 8B open-weight emotive text-to-speech model (released 2026-06-03; modified MIT license, weights from day one). The seed source that opened this wiki’s cluster (parked in the hub _inbox, then migrated here on spin-out).

Architecture

Two transformers: a 7.7B backbone (autoregressive over time) + a 300M decoder (autoregressive over depth).
Generates Mimi audio codes — a neural-audio-codec — via residual vector quantization (RVQ): 32 codebooks × 2048-way → ~10^105 addressable audio tokens “without adding parameters.” (Scaling addressable sound via codebooks instead of a larger flat vocabulary.)
Text vocab 128,256 tokens; max sequence 2,048; default inference torch.bfloat16.

Performance & limits

Claimed 110ms latency vs ElevenLabs 700ms / Sesame 300ms (tts-benchmarks — vendor figure).
Single-turn, half-duplex (“no turn-taking yet”); API announced but not yet available.

Place in the field

Emotive/expressive synthesis with a permissive-ish (open-weight-tts) license — competing on latency + expressivity. Its RVQ/Mimi codec approach is shared with fish-audio-s2-pro (dual-AR + RVQ); the “AR-over-time + AR-over-depth” split echoes the codec-token TTS design now common across the open field.

text-to-speech · neural-audio-codec · open-weight-tts · fish-audio-s2-pro · tts-benchmarks

MisoTTS

Architecture

Performance & limits

Place in the field

Related

Linked from