LiveLingoLiveLingoTry free

OpenAI Live Translation (2026): ChatGPT Voice, gpt-realtime-translate, and Whisper+GPT Compared

OpenAI ships live speech translation across three surfaces as of June 2026: ChatGPT Voice's live translate mode for paid subscribers, the dedicated gpt-realtime-translate model in the Realtime API for developers, and the Whisper + GPT-4o-mini DIY pipeline that remains the flexibility route. This guide describes each surface, the trade-offs between them, what OpenAI's own documentation discloses as limitations, and the independently measured numbers from a published reproducible benchmark.

Conflict of interest

This guide is published by LiveLingo (Lunana Global Inc.), a voice-translation product that competes with ChatGPT Voice's live translate mode and with developer applications built on the OpenAI Realtime API. We have a financial interest in LiveLingo's adoption. Factual claims about OpenAI's products are sourced inline to OpenAI's blog, developer documentation, and ChatGPT Voice consumer page. Measured performance numbers in Section 5 come from LiveLingo's own evaluation of a direct competitor — methodology, raw per-utterance JSON, and the source audio are published at livelingo.io/research/benchmark-2026 for independent verification. Our measurements are explicitly scoped to the raw Realtime API endpoint, not the ChatGPT Voice consumer app — Section 5 documents the distinction.

1. What OpenAI Ships for Live Translation in 2026

Three distinct surfaces are available as of June 2026:

ChatGPT Voice — live translate (consumer). Live translation is built into ChatGPT's Voice mode. A user taps the Voice icon in the ChatGPT app message composer, asks the assistant to translate between languages, and the model continues translating throughout the conversation until told to stop or switch. This requires a paid ChatGPT subscription — Plus, Teams, Enterprise, or Edu (OpenAI consumer pricing page; Plus is ~$20/mo). There is no free-tier live-translate consumer access in our checks as of June 10, 2026. The interface is conversational rather than a dedicated translator UI; there is no source/target language pair selector, no two-column source-and-translated transcript, and no call-dialing.

gpt-realtime-translate (dedicated API model). On May 7, 2026, OpenAI released a purpose-built streaming speech-to-speech translation model inside the Realtime API. According to OpenAI's announcement, the model was "trained on thousands of hours of professional interpreter audio" and is configured to "remain translation-only and wait for enough context before producing speech." It supports 70+ input languages translated into 13 output languages and is priced at $0.034 per minute of input audio (OpenAI API pricing). Documented launch partners named in OpenAI's announcement: Deutsche Telekom (multilingual customer support) and Vimeo (real-time translation of product-education videos).

Whisper + GPT-4o-mini (DIY pipeline). The original developer path remains available. Whisper-large handles speech-to-text (99 languages per OpenAI's speech-to-text guide; $0.006/min audio on OpenAI's API pricing page); GPT-4o-mini handles translation (per-token pricing, same source). Combined, they support arbitrary language pairs — not the 13-output ceiling of gpt-realtime-translate — and give the developer full control over chunking, prompting, glossary handling, and output format. The cost is engineering: Whisper's API does not segment continuous speech into utterance boundaries, so the developer supplies voice-activity detection (VAD), endpoint logic, hallucination filtering, streaming UI, and telephony.

2. ChatGPT Voice — Live Translate Mode (Consumer)

ChatGPT Voice with live translation runs inside the consumer ChatGPT app on iOS, Android, and the web. The user opens a Voice session and gives the assistant a translation instruction such as "translate between English and Japanese." The model then translates each speaker's utterances into the requested target language continuously, across turns, until the user tells it to stop, switch languages, or end the session.

Access requires a paid ChatGPT subscription. The upgraded Voice mode with live translate is available to ChatGPT Plus (~$20/mo, per OpenAI's consumer pricing page), Teams, Enterprise, and Edu users; access is initiated via the Voice icon in the message composer (as documented at chatgpt.com/features/voice and confirmed by Tom's Guide and 9to5Mac's launch coverage). The live-translate feature is not surfaced on the free tier in our checks as of June 10, 2026.

What the interface gives you, and what it does not. The user experience is a conversational Voice session — natural for a one-on-one cross-language exchange or a small in-person conversation. It does not include a dedicated translator UI with a source/target language picker, a two-column source-and-translated transcript pair you can read while listening, a session export, a meeting-memo, or outbound phone-call dialing. The model handles voice activity and turn-taking internally; the user has no explicit control over endpoint timing, glossary, or prompt style.

Underlying model and behavior. ChatGPT Voice's live translate is built on OpenAI's Realtime model family. Launch coverage of the May 7, 2026 release (Tom's Guide, 9to5Mac, Slator) indicates the consumer Voice surface uses the same Realtime infrastructure that hosts gpt-realtime-translate, with consumer-app-layer voice activity detection, conversation state, and UI rendering on top. OpenAI's public model documentation does not describe a separate model card for the consumer Voice translate variant as of June 10, 2026.

3. gpt-realtime-translate — The Dedicated API Model

gpt-realtime-translate is OpenAI's first purpose-built translation model, released on May 7, 2026 inside the Realtime API. It is distinct from the DIY Whisper + GPT-4o-mini route in that the streaming speech-to-speech transformation happens in a single model rather than across two independently-prompted API calls.

Specifications. Per OpenAI's developer cookbook: 70+ input languages auto-detected, 13 output languages. Pricing $0.034 per minute of input audio. Returns translated audio plus text transcripts of both the source speech and the translated output — a transcript surface that the consumer ChatGPT Voice mode does not expose. No speaker attribution and no voice selection. Spoken output cannot be revised after it is emitted.

Training and behavior. OpenAI states the model was "trained on thousands of hours of professional interpreter audio, which helps it remain translation-only and wait for enough context before producing speech." In OpenAI's own evaluation, the model delivered 12.5% lower Word Error Rates than any other model tested on Hindi, Tamil, and Telugu — the documented Indic-language strength of the release.

Translation-mode constraints. According to the OpenAI cookbook, the translation-mode API call is a constrained surface compared to general Realtime API usage. Text input is not supported in translation mode, and tool use and system instructions are disabled — input is audio, output is audio plus transcripts, and the model behaves as a dedicated interpreter rather than a general voice assistant.

4. Whisper + GPT-4o-mini — The DIY Pipeline

The Whisper + GPT-4o-mini route remains available and continues to be the right choice for developers who need behaviors the dedicated translation model does not provide: arbitrary output languages outside the 13-language ceiling, fine-grained prompt and glossary control, custom chunking strategies, or integration with other Realtime API capabilities like tool use.

Specifications. Whisper-large supports 99 input languages for speech-to-text (OpenAI speech-to-text guide) at $0.006 per minute of audio (OpenAI pricing page). GPT-4o-mini handles the translation step with per-token pricing (also on the OpenAI pricing page). The two services are independent network calls; total per-minute cost depends on transcript length but is typically lower than gpt-realtime-translate for English-target use, and higher engineering effort.

What the developer supplies. Production real-time voice translation on top of Whisper + GPT-4o-mini requires the following components, none of which OpenAI ships:

  • Voice activity detection (VAD). Whisper's API surfaces transcription on completed audio chunks but does not segment continuous speech into utterance boundaries; the developer supplies a separate VAD to decide when to send each chunk. Without it, there is no signal for when an utterance ends.
  • Endpoint logic. Decide whether to wait for more audio (lower latency, more revisions) or commit early (higher latency, fewer revisions). The trade-off defines the user experience.
  • Hallucination filtering. Whisper is widely reported to hallucinate English filler text on short clips — common artifacts include "Thanks for watching!" and "Subscribe!", attributed to YouTube content in its training corpus; see the openai/whisper GitHub discussion of hallucinations on short clips. Production deployments require filtering these.
  • Streaming UI primitives. A gated-commit overlay so displayed text does not retract, accumulation of partial chunks, scroll behavior, and the source-vs-translated display.
  • Telephony integration for phone-call use (Twilio, Telnyx, or similar), including bidirectional audio bridging and per-jurisdiction call-recording disclosure compliance.
  • Cost monitoring + rate-limit handling. At sustained usage, per-minute cost can exceed a flat subscription, and per-account rate limits require backoff strategies.

5. How They Perform on Independent Measurement

What we measured (and what we did not)

The numbers below are for the raw gpt-realtime-translate Realtime API endpoint, accessed programmatically via the Python SDK, with the same energy-VAD utterance boundaries applied uniformly to every API-tier system in the LiveLingo benchmark. We did not measure the ChatGPT Voice consumer app separately. ChatGPT Voice is built on the same Realtime infrastructure but the consumer surface adds its own client-side VAD, conversation state, UI rendering, and may apply server-side smoothing we have no programmatic access to. A ChatGPT Voice user may see different perceived latency, lag drift, and code-switching behavior than the API-tier numbers report. Where this section cites specific behaviors (drift, code-switch silence), treat them as the developer-experience floor on the Realtime API endpoint, not the ChatGPT-Voice consumer ceiling. The Whisper + GPT-4o-mini DIY pipeline numbers are similarly API-tier — they reflect what a developer experiences after assembling a naive baseline pipeline, not a hand-tuned production system.

Reproducibility

Every number in this section reproduces from the same three 120-second VOA public-domain audio clips, the same Realtime API endpoint, and the same Python harness used for the original four-system benchmark. The audio (audio.zip), raw per-utterance JSON (openai-realtime-results.json), and methodology are published at livelingo.io/research/benchmark-2026.

gpt-realtime-translate — measured behavior

Fastest first-audio of any system tested. Median 711 ms from start of speech to first translated audio across all 120 evaluated sessions (p10–p90: 485–1,012 ms). For context, Gemini 3.5 Live Translate measured ~2.9 s on the same metric — gpt-realtime-translate is roughly four times faster to first output. Speed is this model's genuine strength.

Comprehension fidelity composite: 4.53 / 5. Scored by two independent frontier LLM judges (GPT-4o, Gemini 2.5 Flash) using the same rubric and judge prompts as the original four-system benchmark, across 120 utterances and four language pairs (en→es, en→zh-CN, en→ja, en→de). This was the lowest score of the six systems measured. Head-to-head against LiveLingo at the cell level: 4 wins, 80 ties, 36 losses. Recurring error classes: extraneous phrases prepended at utterance starts, meaning inversions (e.g. "I was stressed about work" rendered as a wish to be stressed), and proper names replaced with common nouns.

Six-system comparison on the LiveLingo 2026 benchmark — 120 utterances, four language pairs, 2-judge composite (GPT-4o + Gemini 2.5 Flash). The latency column reports a single apples-to-apples metric: time from speaker end-of-utterance to translation arrival (committed transcript for streaming-text systems; spoken-translation arrival for audio-to-audio systems). The separate "speed to first translated output" metric — different baseline, starts at speech onset — is in the column to its right for audio-to-audio systems only, since transcript systems do not emit interim translations. Raw data: livelingo.io/research/benchmark-2026.
SystemComprehension (0–5)Utterance-end → translation arrivalSpeed to first output (audio-to-audio only)Output surface
LiveLingo4.961,518 msStreaming text + audio
Gemini 3.5 Live Translate4.93~3,100 ms (drifts up to 13.9 s)2,947 msAudio (text sidecar)
Google Cloud STT v2 + Translate v34.77~26,736 msTranscript
Azure Speech Translation4.65~4,755 msTranscript
Whisper + GPT-4o-mini (DIY)4.632,720 msTranscript
OpenAI gpt-realtime-translate4.53~3,800 ms (drifts up to 20.3 s)711 msAudio + transcript

Lag drift on continuous speech. Speed-to-first-output is excellent, but on extended audio the translated voice falls progressively behind the speaker as untranslated backlog accumulates. Measuring from each source-utterance end to the arrival of the translated speech for that utterance: median 3.8 s, drifting as far as 20.3 s behind on the dense pt→en VOA clip. This is the trade-off the audio-to-audio architecture creates — speech output is naturally bounded by the speaking rate of the synthesized voice, so the model cannot "catch up" faster than human pace.

Code-switched speech failure. Per OpenAI's developer documentation, the model may skip speech that is already in the output language. On the zh→en VOA clip in the LiveLingo benchmark, this surfaced as silence at the 86-second mark, when the source switched into English speech — the model went silent and did not pass the English content through to the translated output. Gemini 3.5 Live Translate exhibits the same gap on the same clip; this is a class issue for audio-to-audio dedicated translation models (see callout below). Pipelines that surface a streaming text transcript can pass code-switched content through to the displayed transcript instead of dropping it.

Output surfaces. Translated audio plus text transcripts of both source and output — closer to a transcript-first product surface than Gemini 3.5 Live Translate's audio-only API. No speaker attribution. No voice selection. Spoken output cannot be revised after it is emitted.

Audio-to-audio is a class with shared limitations

The behaviors in this section are not unique to gpt-realtime-translate. Google's Gemini 3.5 Live Translate, and any other current speech-to-speech audio-to-audio translation model, inherits the same class of trade-offs: (1) output-pace lag drift on continuous speech, because translated audio is bounded by speaking rate and cannot catch up faster than human pace; (2) code-switch silence, because the model is configured to skip speech already in the output language; (3) no in-line speaker attribution in the synthesized audio; (4) irreversible mid-utterance commits, because spoken audio cannot be retracted the way displayed text can. Systems that surface a streaming text transcript — including OpenAI's DIY Whisper + GPT-4o-mini route and streaming-transcript translation products like LiveLingo — avoid (2), (3), and (4) at the cost of either two-model latency overhead or a different output modality. Treat this as a category insight, not a critique of one model.

Whisper + GPT-4o-mini DIY pipeline — measured behavior

On the same three 120-second VOA clips, a naive baseline Whisper-large + GPT-4o-mini pipeline measured a median Final Transcript Latency of 2,720 ms (95% CI 1,880–3,396, n=28), and emitted ≈22 Normalized Erasures per 120-second clip (token revisions across partial chunks). Comprehension fidelity composite was 4.63 / 5 across the same four language pairs.

Notably: the DIY pipeline scored higher comprehension than the dedicated gpt-realtime-translate model (4.63 vs 4.53). The dedicated model is faster to first output and easier to integrate, but on this benchmark the older two-model pipeline reads source meaning slightly more accurately. The differences are within ~0.10 on a 5-point scale and reflect different design priorities — speed and operational simplicity for the dedicated model, transcript-accuracy and prompt control for the pipeline.

6. What OpenAI's Own Documentation Discloses

Statements drawn directly from OpenAI's May 7, 2026 announcement and developer documentation:

  • Training corpus. "Trained on thousands of hours of professional interpreter audio, which helps it remain translation-only and wait for enough context before producing speech." (Source: OpenAI announcement.)
  • Language coverage. 70+ input languages into 13 output languages. (Source: OpenAI Cookbook.)
  • Indic-language strength. "12.5% lower Word Error Rates than any other model tested" on Hindi, Tamil, and Telugu in OpenAI's own evaluation. (Source: OpenAI announcement.)
  • Code-switching behavior. OpenAI's documentation states the model may skip speech already in the output language — a design choice that produces silence on code-switched audio.
  • Mode constraints. In translation mode, text input is not supported and tool use plus system instructions are disabled. The translation-mode call is a constrained surface compared to the general Realtime API.
  • Output format (developer). Audio is sent and received in raw PCM with chunked streaming. Refer to the Realtime API guide for the exact format and chunk-size guidance.
  • Pricing. $0.034 per minute of input audio for gpt-realtime-translate. $0.006 per minute audio for Whisper. GPT-4o-mini per-token. ChatGPT Plus is approximately $20/mo and is the minimum paid tier for ChatGPT Voice live translate access.
  • Documented launch users. Deutsche Telekom (multilingual customer support) and Vimeo (real-time translation of product education videos). (Source: OpenAI announcement.)

7. When to Choose Which Surface — and When Another Tool Fits

Choose ChatGPT Voice live translate if

  • You already pay for ChatGPT Plus (or Teams, Enterprise, Edu) and don't want to add another subscription.
  • Your use case is a one-on-one or small in-person conversation rather than a multi-party meeting that needs displayed transcripts.
  • You accept a conversational-mode interface rather than a dedicated translator UI with source/target language pickers and a saved transcript.
  • You are comfortable with the model handling voice activity and turn-taking internally, without explicit user control.

Choose gpt-realtime-translate (Realtime API) if

  • You are building a developer application where time-to-first-translated-audio matters more than comprehension margin.
  • Your output language list fits inside 13 languages.
  • You serve Indic-language audiences (Hindi, Tamil, Telugu) where OpenAI's own evaluation reports 12.5% WER reduction over alternatives.
  • You can build the consumer-facing layer (UI, telephony, error handling, code-switch fallbacks) on top of OpenAI's API.
  • You accept the speed-vs-comprehension trade-off (4.53/5 comprehension vs 4.63 for the DIY pipeline on the same benchmark) in exchange for one API call instead of two.

Choose Whisper + GPT-4o-mini DIY if

  • You need arbitrary output languages outside the 13-language ceiling.
  • You need full prompt and glossary control for specialized vocabulary or style constraints.
  • You have engineering capacity for VAD, endpoint detection, hallucination filtering, streaming UI, and telephony.
  • You want lower per-minute audio cost ($0.006 Whisper) and can accept per-token GPT-4o-mini pricing.
  • You want to integrate translation with the broader Realtime API capability surface (tool use, system instructions) that the dedicated translation mode does not expose.

Where a different tool may fit better

OpenAI's three surfaces cover most live-translation use cases, but each lives inside a specific shape: ChatGPT Voice is a chatbot with translation, gpt-realtime-translate is a developer API, and Whisper + GPT-4o-mini is a set of building blocks. A dedicated translator-app surface — with streaming text + audio output you can read while listening, per-speaker attribution, gated-commit displayed transcripts that never retract, translated outbound phone calls, and a free tier outside a subscription gate — is a different product category. LiveLingo (publishing this guide) sits there. Honest trade-off: LiveLingo's audio output runs through the host platform's default text-to-speech engine, so the spoken voice is less expressive than gpt-realtime-translate's; ChatGPT Voice's conversational interface can feel more natural than a dedicated translator UI for casual back-and-forth. Side-by-side specs: /compare/chatgpt-translation. Benchmark numbers: /research/benchmark-2026.

8. Frequently Asked Questions

What live translation does OpenAI offer in 2026?

OpenAI ships live translation across three surfaces as of mid-2026. ChatGPT Voice includes a live translate mode for paid subscribers (Plus, Teams, Enterprise, Edu). gpt-realtime-translate is a dedicated streaming speech-to-speech translation model in the Realtime API, released May 7, 2026, priced at $0.034 per minute of input audio with 70+ input languages and 13 output languages. A DIY pipeline of Whisper-large (speech-to-text) and GPT-4o-mini (translation) remains available for developers who want arbitrary language pairs and full control of the stack.

How does ChatGPT Voice live translate work?

Tap the Voice icon in the ChatGPT app message composer, then ask the assistant to translate — e.g. "translate between English and Japanese." The model keeps translating across turns until told to stop or switch languages. Available to paid ChatGPT subscribers (Plus ~$20/mo, Teams, Enterprise, or Edu). It is a conversational voice surface, not a dedicated translator UI with source/target language selectors, source-and-translated transcript pairs, or call-dialing.

What is gpt-realtime-translate?

OpenAI's dedicated streaming speech-to-speech translation model in the Realtime API, released on May 7, 2026. Trained on thousands of hours of professional interpreter audio. 70+ input languages → 13 output languages. Priced at $0.034 per minute of input audio. Returns translated audio plus text transcripts of both source and output. Documented enterprise users at launch include Deutsche Telekom and Vimeo.

Can you still build a live translator with Whisper and GPT-4o-mini?

Yes. The DIY pipeline (Whisper-large $0.006/min audio, 99 source languages; GPT-4o-mini per-token) remains the most flexible OpenAI route — it supports arbitrary language pairs and gives full control over chunking, prompting, and output format. The trade-off is engineering cost: Whisper has no native sentence-boundary detection, so the developer must build VAD, endpoint logic, hallucination filtering, streaming UI, and telephony.

What are gpt-realtime-translate's measured latency and comprehension?

In the LiveLingo Research benchmark addendum (June 10, 2026), gpt-realtime-translate had the fastest first-audio latency of any system tested — median 711 ms from start of speech to first translated audio. Comprehension fidelity composite was 4.53 / 5, the lowest of the six systems measured. On continuous speech, translated voice fell behind the speaker — median 3.8 s, drifting up to 20.3 s on dense audio. Recurring errors: extraneous insertions, meaning inversions, proper-name substitutions. Source: livelingo.io/research/benchmark-2026.

Do these numbers reflect the ChatGPT Voice user experience?

No. The measured numbers are for the raw gpt-realtime-translate Realtime API call. ChatGPT Voice is built on the same Realtime infrastructure but the consumer app adds its own client-side VAD, conversation state, UI rendering, and may apply server-side smoothing not measured separately. A ChatGPT Voice user may see different perceived latency, lag drift, and code-switching behavior than the API-tier numbers report. Treat the published benchmark as the developer-experience floor on the Realtime API endpoint, not the ChatGPT-Voice user ceiling.

How does OpenAI handle code-switching?

Per OpenAI's developer documentation, gpt-realtime-translate may skip speech already in the output language. In the LiveLingo benchmark this surfaced as silence on the zh→en VOA clip at the 86-second mark when the source switched into English. Gemini 3.5 Live Translate exhibits the same gap on the same clip. Streaming text-transcript systems that pass target-language speech through to the displayed transcript do not have this gap.

When should you choose which OpenAI surface?

ChatGPT Voice live translate if you already pay for ChatGPT Plus or higher and accept a conversational interface. gpt-realtime-translate if you build a developer application where speed-to-first-audio matters more than displayed-text stability, your output language list fits inside 13, and you can build the consumer surface on top. Whisper + GPT-4o-mini DIY if you need arbitrary output languages, full prompt and glossary control, lower per-minute cost, and engineering capacity to build VAD, endpoint detection, hallucination filtering, streaming UI, and telephony.

9. Sources

Pricing, availability, launch users, and consumer-tier access details verified against the primary sources above on June 10, 2026. OpenAI may change tiers, pricing, language coverage, and model behavior; consult the linked sources for current state before relying on any specific number.

OpenAI Live Translation (2026): ChatGPT Voice, gpt-realtime-translate, and Whisper+GPT Compared | LiveLingo