Spokes.wiki Search Graph Growth About

speech-audio-wiki

Defined Term mechanism updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Voice cloning

Synthesizing speech in a target speaker’s voice from a short reference sample — ideally zero-shot (no per-voice fine-tuning). A major capability axis that splits the text-to-speech field.

Who has it (and who doesn’t)

So “can it clone a voice?” is a clean dividing line: presence of cloning often correlates with larger models and conditioning on reference audio, while the smallest efficiency models drop it.

Adjacent capabilities

Cloning sits alongside emotion/style control (guided emotion in orpheus; Hume Octave 2 inferring emotion from meaning) and multi-speaker dialogue (Dia, sesame-csm). Together these “controllability” features, not just raw naturalness, increasingly differentiate models.

Note

Voice cloning carries obvious consent/impersonation risk; the wiki tracks it as a technical capability, but it is the capability most entangled with abuse and licensing/consent questions.

text-to-speech · orpheus · sesame-csm · fish-audio-s2-pro · kokoro · open-source-tts-models