Software Application source ↗ source url updated Sun Jun 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

GPT-Realtime-2 (OpenAI speech-to-speech)

OpenAI’s voice model for the Realtime API — billed as its “first voice model with GPT-5-class reasoning” (knowledge cutoff Sep 30 2024). It’s a speech-to-speech model: you speak, it reasons, it speaks back, over a WebRTC transport. A third instance of the spoke’s real-time conversational thread, after elevenlabs-expressive-mode (dialogue) and gemini-live-3-5-translate (translation). Sourced from Simon Willison’s demo write-up (a personal blog — tier T4; the primary is OpenAI’s own release). URL-only ingest.

What it is

One model, audio in → audio out. Rather than the ElevenLabs-style STT→LLM→TTS pipeline of separate parts, the conversational loop is collapsed into a single end-to-end voice model — the same shape as gemini-live-3-5-translate, here aimed at general dialogue, not translation.
GPT-5-class reasoning in the voice model. The pitch is that the reasoning of the text frontier now rides inside the speech model — not a thin TTS voice bolted onto a chatbot.
Document context. You can paste a text document into the session, then explore it by voice — the model answers questions grounded in the pasted text. A retrieval/grounding capability added to the spoken loop (the voice analog of feeding context to a chat model).
Transport: WebRTC, for the browser-to-model real-time audio path.

Why it matters here

The conversational composite gets a frontier-lab entrant. ElevenLabs sells the assembled loop; OpenAI ships a single speech-to-speech model with frontier reasoning — strengthening the closed- frontier read across the spoke (OpenAI joins Google and ElevenLabs at the real-time-voice frontier).
“The LLM eats audio from both ends,” taken to its limit. A speech-to-speech model carrying full text-model reasoning is the cleanest case yet of audio folded into one language model — both ends in one network, now with the reasoning, not just the tokens.
Grounding enters the voice loop. “Document context” adds a what it can talk about accurately axis to conversational voice, beyond latency/affect — closer to retrieval-grounded dialogue than open-ended chat.

Caveat

Secondhand source (a practitioner demo, T4); no published latency, audio-quality, or WER numbers. Model naming and availability are in flux (the post notes GPT-Realtime-2 still hadn’t reached ChatGPT’s iPhone app). Treat capability claims as a dated snapshot.

Cross-spoke

GPT-Realtime-2’s “GPT-5-class reasoning” ties it to ../llm-providers-wiki (the OpenAI text-model frontier it inherits from) and to research-wiki’s retrieval-augmented-generation for the document- grounding angle — noted as adjacency, not paged here; this page keeps the speech-model substance.

elevenlabs-expressive-mode · gemini-live-3-5-translate · speech-to-speech-translation · text-to-speech · speech-to-text · synthesis

GPT-Realtime-2 (OpenAI speech-to-speech)

What it is

Why it matters here

Caveat

Cross-spoke

Related

Linked from