Neural audio codec (discrete audio tokens)
The representation most modern text-to-speech models generate: instead of predicting a raw waveform, the model predicts discrete audio tokens from a learned neural codec, which a decoder turns into sound. This is what lets a transformer “do speech as language modeling.”
Residual vector quantization (RVQ)
The dominant scheme. Each audio frame is encoded as several codebook indices that sum (residually) to reconstruct the sound, rather than one index from a giant flat vocabulary. The payoff, per misotts: 32 codebooks × 2048 entries → ~10^105 addressable audio tokens “without adding parameters” — far more sonic range than scaling a single vocabulary would allow.
Seen in the wild
- Mimi codes via RVQ — misotts (8B; an AR-over-time backbone + AR-over-depth decoder over the codebooks).
- Dual-autoregressive + RVQ — fish-audio-s2-pro‘s architecture.
- Codec tokens underlie the broader “TTS on an LLM backbone” pattern (Llama-based orpheus, sesame-csm, Higgs Audio V2) — text and audio both become next-token prediction.
Cross-wiki note
RVQ here quantizes audio into tokens; the related idea of quantizing model weights to low precision is quantization (llm-providers-wiki). Same “vector quantization” family, different target — worth not conflating.
Related
text-to-speech · misotts · fish-audio-s2-pro · orpheus · sesame-csm