GPT-Realtime-2 (OpenAI speech-to-speech)
OpenAI’s voice model for the Realtime API — billed as its “first voice model with GPT-5-class reasoning” (knowledge cutoff Sep 30 2024). It’s a speech-to-speech model: you speak, it reasons, it speaks back, over a WebRTC transport. A third instance of the spoke’s real-time conversational thread, after elevenlabs-expressive-mode (dialogue) and gemini-live-3-5-translate (translation). Sourced from Simon Willison’s demo write-up (a personal blog — tier T4; the primary is OpenAI’s own release). URL-only ingest.
What it is
- One model, audio in → audio out. Rather than the ElevenLabs-style STT→LLM→TTS pipeline of separate parts, the conversational loop is collapsed into a single end-to-end voice model — the same shape as gemini-live-3-5-translate, here aimed at general dialogue, not translation.
- GPT-5-class reasoning in the voice model. The pitch is that the reasoning of the text frontier now rides inside the speech model — not a thin TTS voice bolted onto a chatbot.
- Document context. You can paste a text document into the session, then explore it by voice — the model answers questions grounded in the pasted text. A retrieval/grounding capability added to the spoken loop (the voice analog of feeding context to a chat model).
- Transport: WebRTC, for the browser-to-model real-time audio path.
Why it matters here
- The conversational composite gets a frontier-lab entrant. ElevenLabs sells the assembled loop; OpenAI ships a single speech-to-speech model with frontier reasoning — strengthening the closed- frontier read across the spoke (OpenAI joins Google and ElevenLabs at the real-time-voice frontier).
- “The LLM eats audio from both ends,” taken to its limit. A speech-to-speech model carrying full text-model reasoning is the cleanest case yet of audio folded into one language model — both ends in one network, now with the reasoning, not just the tokens.
- Grounding enters the voice loop. “Document context” adds a what it can talk about accurately axis to conversational voice, beyond latency/affect — closer to retrieval-grounded dialogue than open-ended chat.
Caveat
Secondhand source (a practitioner demo, T4); no published latency, audio-quality, or WER numbers. Model naming and availability are in flux (the post notes GPT-Realtime-2 still hadn’t reached ChatGPT’s iPhone app). Treat capability claims as a dated snapshot.
Cross-spoke
GPT-Realtime-2’s “GPT-5-class reasoning” ties it to ../llm-providers-wiki (the OpenAI text-model
frontier it inherits from) and to research-wiki’s retrieval-augmented-generation for the document-
grounding angle — noted as adjacency, not paged here; this page keeps the speech-model substance.
Related
elevenlabs-expressive-mode · gemini-live-3-5-translate · speech-to-speech-translation · text-to-speech · speech-to-text · synthesis