WaveNet
WaveNet (DeepMind, 2016) is “a deep neural network for generating raw audio” — the breakthrough that made neural text-to-speech sound human and the ancestor of the modern stack this wiki tracks. The spoke documents today’s codec-token / SALM models but lacked their origin point. Source: Wikipedia.
Architecture
Autoregressive, sample-by-sample generation via “a stack of dilated causal convolutional layers” — modeling the raw waveform directly instead of concatenating recorded fragments. The leap from concatenative to learned end-to-end synthesis.
Impact & the latency lesson
Productionized for Google Assistant voices. But it embodies the spoke’s central latency axis in origin form: initially it “required too much computational processing power” (sample-by-sample), which forced Parallel WaveNet (2017) — “more than 20× faster than real-time.” That same tension — autoregressive quality vs. real-time latency — is exactly what today’s RVQ codec TTS and orpheus/kokoro still negotiate, and why TTFA is a product axis. WaveNet is where the “LLM eats audio” arc begins: next-sample → next-codec-token modeling.
Related
text-to-speech · neural-audio-codec · speech-audio-ai · voice-cloning · audio-deepfake