ElevenLabs Expressive Mode (Conversational AI)
ElevenLabs’ real-time conversational-voice-agent offering — the company’s push from narration TTS into interactive, emotion-aware dialogue agents. A second composite/applied task in this spoke beside S2ST: it fuses recognition + reasoning + synthesis into one low-latency loop. Vendor landing/signup page (an ad), so claims are first-party marketing (tier T3); the stack facts are product specs.
The stack
- TTS: Eleven V3 Conversational (the model required for Expressive Mode) — ElevenLabs’ emotive text-to-speech line, here tuned for turn-by-turn dialogue.
- STT: Scribe v2 Realtime — streaming speech-to-text (the realtime sibling of the Scribe model whose ~3.3% EN WER the stt-apis-comparison thread cites).
- Latency: “ultra-low latency” / “real-time responsiveness” — the make-or-break conversational axis (vendor-stated, conditions unspecified).
- Languages: 70+, with called-out improvements for Hindi, Japanese, Spanish, Arabic.
What’s actually new — emotion as a quality axis
The pitch is expressivity from prosody, not just text: the agent infers emotion from how something is said (pitch, pacing, exclamations) and applies “tone cue cards” (reassuring / apologetic / enthusiastic) to match context. This adds an emotion/affect dimension to the spoke’s quality axes (Elo/MOS/WER/latency) — the first source here to make paralinguistic understanding the headline differentiator rather than raw audio fidelity or accuracy.
Why it matters here
- Reinforces the closed-frontier read. A polished, proprietary, integrated conversational product — exactly the commercial archetype the open wedge competes against, now bundling STT+TTS into one agent rather than selling components.
- A second composite task. Like Gemini Live S2ST, it shows the branches compose into real-time products — here STT→LLM→TTS as a conversational loop, where gemini-live-3-5-translate was a translation loop.
- Sharpens the latency open question. Another “ultra-low latency” claim with unstated conditions — more weight on the synthesis’s call for a neutral latency benchmark.
Caveat
Marketing page: CSAT/NPS/conversion and latency claims are unverified vendor figures. “V3 Conversational” / “Scribe v2 Realtime” are product names, not benchmarked here. Treat as a dated snapshot of ElevenLabs’ positioning.
Related
elevenlabs · text-to-speech · speech-to-text · speech-to-speech-translation · gemini-live-3-5-translate · tts-benchmarks · stt-apis-comparison · synthesis