Voice cloning
Synthesizing speech in a target speaker’s voice from a short reference sample — ideally zero-shot (no per-voice fine-tuning). A major capability axis that splits the text-to-speech field.
Who has it (and who doesn’t)
- Has zero-shot cloning: orpheus, Chatterbox (strong), CosyVoice 2, IndexTTS-2 (which also separates timbre from emotion for dubbing), fish-audio-s2-pro open-source-tts-models.
- No cloning by design: kokoro — it ships fixed preset voices, trading the capability for a tiny 82M footprint tts-models-2026-benchmark.
So “can it clone a voice?” is a clean dividing line: presence of cloning often correlates with larger models and conditioning on reference audio, while the smallest efficiency models drop it.
Adjacent capabilities
Cloning sits alongside emotion/style control (guided emotion in orpheus; Hume Octave 2 inferring emotion from meaning) and multi-speaker dialogue (Dia, sesame-csm). Together these “controllability” features, not just raw naturalness, increasingly differentiate models.
Note
Voice cloning carries obvious consent/impersonation risk; the wiki tracks it as a technical capability, but it is the capability most entangled with abuse and licensing/consent questions.
Related
text-to-speech · orpheus · sesame-csm · fish-audio-s2-pro · kokoro · open-source-tts-models