Speech-to-speech translation (S2ST)
Speech-to-speech translation is the applied speech-AI task of turning spoken input in one language into spoken output in another — composing the spoke’s two recognition/synthesis branches (speech-to-text + machine translation + text-to-speech) into a single pipeline, increasingly end-to-end and streaming. A composite/applied task that sits across the three primitive branches of speech-audio-ai rather than beside them.
What makes it hard (the axes it adds)
- Simultaneity vs. quality. Streaming/simultaneous interpretation must translate before the sentence finishes — trading latency against the context needed for an accurate rendering. This sharpens the spoke’s existing latency-as-product-axis read into a live trade-off.
- Voice / prosody preservation. Carrying the original speaker’s intonation, pacing, pitch (and ideally identity) into the target language — a voice-cloning-adjacent capability with the same consent/impersonation considerations.
- End-to-end vs. cascaded. Classic systems cascade STT→MT→TTS (errors compound, latency stacks); newer models do it end-to-end over streamed audio, the “speech as language modeling” pattern.
Where it sits
The 2026 exemplar is gemini-live-3-5-translate (Google; 70+ languages, near-real-time, voice-preserving, SynthID-watermarked) — a closed-frontier model that fuses STT+TTS at once. It is the purest case of the speech-audio-ai thesis that the LLM is eating audio from both ends — here, both ends in one model.
Related
speech-audio-ai · speech-to-text · text-to-speech · gemini-live-3-5-translate · voice-cloning · gemini