Defined Term mechanism updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Voice cloning

Synthesizing speech in a target speaker’s voice from a short reference sample — ideally zero-shot (no per-voice fine-tuning). A major capability axis that splits the text-to-speech field.

Who has it (and who doesn’t)

Has zero-shot cloning: orpheus, Chatterbox (strong), CosyVoice 2, IndexTTS-2 (which also separates timbre from emotion for dubbing), fish-audio-s2-pro open-source-tts-models.
No cloning by design: kokoro — it ships fixed preset voices, trading the capability for a tiny 82M footprint tts-models-2026-benchmark.

So “can it clone a voice?” is a clean dividing line: presence of cloning often correlates with larger models and conditioning on reference audio, while the smallest efficiency models drop it.

Adjacent capabilities

Cloning sits alongside emotion/style control (guided emotion in orpheus; Hume Octave 2 inferring emotion from meaning) and multi-speaker dialogue (Dia, sesame-csm). Together these “controllability” features, not just raw naturalness, increasingly differentiate models.

Note

Voice cloning carries obvious consent/impersonation risk; the wiki tracks it as a technical capability, but it is the capability most entangled with abuse and licensing/consent questions.

text-to-speech · orpheus · sesame-csm · fish-audio-s2-pro · kokoro · open-source-tts-models

Voice cloning

Who has it (and who doesn’t)

Adjacent capabilities

Note

Related

Linked from