Spokes.wiki Search Graph Growth About

speech-audio-wiki

Defined Term mechanism updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Neural audio codec (discrete audio tokens)

The representation most modern text-to-speech models generate: instead of predicting a raw waveform, the model predicts discrete audio tokens from a learned neural codec, which a decoder turns into sound. This is what lets a transformer “do speech as language modeling.”

Residual vector quantization (RVQ)

The dominant scheme. Each audio frame is encoded as several codebook indices that sum (residually) to reconstruct the sound, rather than one index from a giant flat vocabulary. The payoff, per misotts: 32 codebooks × 2048 entries → ~10^105 addressable audio tokens “without adding parameters” — far more sonic range than scaling a single vocabulary would allow.

Seen in the wild

Cross-wiki note

RVQ here quantizes audio into tokens; the related idea of quantizing model weights to low precision is quantization (llm-providers-wiki). Same “vector quantization” family, different target — worth not conflating.

text-to-speech · misotts · fish-audio-s2-pro · orpheus · sesame-csm