MisoTTS
Miso Labs’ 8B open-weight emotive text-to-speech model (released 2026-06-03;
modified MIT license, weights from day one). The seed source that opened this wiki’s cluster
(parked in the hub _inbox, then migrated here on spin-out).
Architecture
- Two transformers: a 7.7B backbone (autoregressive over time) + a 300M decoder (autoregressive over depth).
- Generates Mimi audio codes — a neural-audio-codec — via residual vector quantization (RVQ): 32 codebooks × 2048-way → ~10^105 addressable audio tokens “without adding parameters.” (Scaling addressable sound via codebooks instead of a larger flat vocabulary.)
- Text vocab 128,256 tokens; max sequence 2,048; default inference
torch.bfloat16.
Performance & limits
- Claimed 110ms latency vs ElevenLabs 700ms / Sesame 300ms (tts-benchmarks — vendor figure).
- Single-turn, half-duplex (“no turn-taking yet”); API announced but not yet available.
Place in the field
Emotive/expressive synthesis with a permissive-ish (open-weight-tts) license — competing on latency + expressivity. Its RVQ/Mimi codec approach is shared with fish-audio-s2-pro (dual-AR + RVQ); the “AR-over-time + AR-over-depth” split echoes the codec-token TTS design now common across the open field.
Related
text-to-speech · neural-audio-codec · open-weight-tts · fish-audio-s2-pro · tts-benchmarks