
1. What Gemini 3.5 Live Translate Is
Gemini 3.5 Live Translate is a streaming speech-to-speech translation model that Google announced on June 9, 2026. Two characteristics set it apart from earlier translation products.
First, it is audio-to-audio rather than the older speech-to-text-to-translation-to-text-to-speech pipeline. The model accepts streamed source audio in 100-millisecond chunks and produces translated speech as output. Text transcripts are available, but only as a sidecar of the spoken output — there is no streaming text mode and no speaker attribution in the translated audio.
Second, the generated voice is designed to preserve speaker prosody. Google's announcement describes output that retains the speaker's intonation, pacing, and pitch. In practice this produces a translated voice that sounds substantially more natural than a generic text-to-speech engine reading a translation aloud — a real advantage over speech-translation systems whose audio output runs through a standard TTS layer.
The model is built on Gemini 3 Pro. According to the Gemini 3.5 Audio model card published by Google DeepMind, it accepts audio input with up to a 128K-token context window and produces audio + text output up to 64K tokens. It auto-detects over 70 languages, including rapid language switches between speakers, though that detection has documented weaknesses (covered in Section 4).
The launch covers three product surfaces in parallel: developer access via the Gemini Live API and Google AI Studio (public preview from June 9, 2026); consumer access through the Google Translate app on Android and iOS, rolling out globally starting that day, with a new "listening mode" on Android; and enterprise access through Google Meet in private preview for select Google Workspace customers, where it expands Meet's translation coverage from 5 languages to 70+ and supports over 2,000 source/target combinations within a single meeting (per Google's launch announcement).
2. How It Works: Audio-to-Audio Architecture and Prosody Preservation
Three architectural choices distinguish Gemini 3.5 Live Translate from earlier streaming-translation systems.
Speech-to-speech, not speech-to-text-to-speech
Traditional pipelines run audio through a streaming speech-to-text model, feed the transcript to a machine-translation model, then synthesize the translation through a separate text-to-speech model. Each stage adds latency and accumulates errors. Gemini 3.5 Live Translate folds these steps into one audio model. The trade-off: the output is permanent audio, not editable text — once a word is spoken, it cannot be revised mid-utterance.
Continuous streaming, not turn-based
Google's announcement frames the model as one that "balances the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker." Earlier consumer products like Google Translate's previous Conversation mode were turn-based: tap, speak, wait for the system to finalize and emit the translation, then let the other party tap. Gemini 3.5 Live Translate emits translated speech continuously while the source speaker is still talking, with Google describing a lag of "a few seconds."
Prosody transfer
The model is designed to carry the source speaker's vocal characteristics — intonation, pacing, emphasis, pitch — into the translated audio. This is the main technical reason the output sounds natural rather than robotic. It is also the source of the voice-consistency limitations Google's model card discloses (Section 4).
On the developer surface, each session uses raw 16-bit PCM audio at 16 kHz mono as input and produces 24 kHz mono PCM audio as output, sent in 100-millisecond chunks (per MarkTechPost's launch coverage of Google's developer documentation; see Google AI Studio for the canonical reference). All generated audio carries Google's SynthID watermark — an imperceptible signature woven into the waveform that allows downstream systems to identify the audio as machine-generated.

3. Where Gemini 3.5 Live Translate Is Strongest
Five product strengths show up immediately when comparing Gemini 3.5 Live Translate to its peers.
Natural-sounding translated speech. The prosody-preserving voice is the clearest advantage over speech-translation systems whose audio output goes through a generic TTS engine. If you have used a voice-translation app whose translated audio sounds like a flat narrator reading a string of words, the contrast is immediate. Gemini 3.5 Live Translate is materially better here, and the difference is audible on the first sentence.
Audio-to-audio simplicity. Building a speech-translation application has traditionally meant chaining a streaming STT model (Whisper-large, Google Cloud Speech-to-Text, Azure Speech), a translation model, and a TTS engine — and managing the partial-emit semantics of each. Gemini 3.5 Live Translate replaces that chain with one API call, simplifying both the application code and the failure surface.
Auto language detection at scale. 70+ languages auto-detected, with no need for the user to set a language pair in advance. Google's positioning emphasizes use cases like multi-party meetings where speakers switch languages mid-conversation.
Distribution. Built directly into the Google Translate consumer app and Google Meet. For end users, the install and discovery cost is near zero — they already have the app. For Meet customers, translation arrives as a feature toggle inside a workflow that is already in use.
Watermarked output. SynthID watermarking makes the generated speech identifiable as AI-generated for downstream compliance use cases, which is useful in regulated industries that need to track AI-generated content.
4. What Google's Own Model Card Admits as Limitations
The Gemini 3.5 Audio model card published by Google DeepMind documents specific known limitations of Gemini 3.5 Live Translate. Quoting the card directly:
Language detection
"Language detection can struggle with non-native accents, similar languages, or rapid language switches." Practical implication: if a speaker has a strong accent, or the source language is close to a related language (Portuguese vs. Spanish, Norwegian vs. Swedish), or the conversation switches languages quickly, the detector may pick the wrong source language and translate accordingly.
Voice consistency in multi-speaker sessions
"Voices can be inconsistent, and voices may shift after long pauses, change gender, or get stuck on one voice during rapid multi-speaker sessions." This is the most practically significant limitation for many use cases. In a meeting with several speakers taking rapid turns, the model may produce all translated output in one voice — losing the speaker attribution that listeners rely on to follow the conversation.
Noise filtering
"Designed to filter out background noise, but not all background audio may be ignored." Real-world environments will still leak through under some conditions.
Translation-mode constraints (developer API)
Per MarkTechPost's launch coverage, which cites Google's developer documentation: "text input is not supported in translation mode" and the model "drops tool use and system instructions in this mode." For developers, the translation API call is a constrained surface — you cannot send text, you cannot use the broader Gemini tool ecosystem, and you cannot inject system prompts. Translation in, translation out.
5. Independent Measurements From the LiveLingo 2026 Benchmark
What we measured (and what we did not)
The numbers below are for the raw Gemini Live API endpoint, accessed programmatically with the same energy-VAD utterance boundaries applied uniformly to every API-tier system in the LiveLingo benchmark. We did not measure the Google Translate consumer app or Google Meet integration separately. Both are built on the same Gemini 3.5 Live Translate model but the consumer / Meet surfaces add their own client-side VAD, conversation state, UI rendering, and may apply server-side smoothing we have no programmatic access to. A Google Translate user or a Meet participant may see different perceived latency, code-switching behavior, and voice consistency than the API-tier numbers report. Where this section cites specific behaviors (multi-speaker drift, code-switch silence), treat them as the developer-experience floor on the Live API endpoint, not the consumer ceiling.
Reproducibility
Every number in this section reproduces from the same three 120-second VOA public-domain audio clips, the same Gemini Live API endpoint, and the same Python harness used for the original four-system benchmark. The audio (audio.zip), raw per-utterance JSON (gemini-live-results.json), and methodology are published at livelingo.io/research/benchmark-2026.
Conflict caveat: LiveLingo Research evaluated a direct commercial competitor on the day Google released it. We have a financial interest in the comparison's framing. Treat this section as one data point alongside Google's own announcement and third-party launch coverage; do not treat it as the definitive third-party benchmark.
With those scopes in mind: LiveLingo Research evaluated Gemini 3.5 Live Translate on its launch day (June 9, 2026) against the same protocol used for the original benchmark of Google Cloud STT v2 + Translation v3, Azure Speech Translation, and Whisper-large + GPT-4o-mini. The full addendum (including the later June 10, 2026 OpenAI gpt-realtime-translate addendum) is published at livelingo.io/research/benchmark-2026#comprehension-gemini-live; the headline numbers are below.
Comprehension fidelity composite: 4.93 / 5 across 120 utterances and four language pairs (en→es, en→zh-CN, en→ja, en→de). This is the strongest result among the four competing systems on the original benchmark.
First-audio latency: median 2,947 ms from start of speech to first translated audio (p10–p90: 2,859–3,104 ms). This is a constant ~3-second speaking delay, consistent with Google's "a few seconds behind" framing.
| System | Comprehension (0–5) | Utterance-end → translation arrival | Speed to first output (audio-to-audio only) | Output surface |
|---|---|---|---|---|
| LiveLingo | 4.96 | 1,518 ms | — | Streaming text + audio |
| Gemini 3.5 Live Translate | 4.93 | ~3,100 ms (drifts up to 13.9 s) | 2,947 ms | Audio (text sidecar) |
| Google Cloud STT v2 + Translate v3 | 4.77 | ~26,736 ms | — | Transcript |
| Azure Speech Translation | 4.65 | ~4,755 ms | — | Transcript |
| Whisper + GPT-4o-mini (DIY) | 4.63 | 2,720 ms | — | Transcript |
| OpenAI gpt-realtime-translate | 4.53 | ~3,800 ms (drifts up to 20.3 s) | 711 ms | Audio + transcript |
Output is translated speech only. The API has no streaming text mode and no per-speaker attribution. Text transcripts are available as a sidecar to the spoken output. Spoken output cannot be revised after it is emitted.
Code-switched audio. On a Mandarin news clip that switches to English street interviews at 86 seconds, the LiveLingo benchmark recorded that translation output stops at the switch in every run: speech already in the output language is neither translated nor transcribed, so the final 34 seconds of content (~28% of the clip) silently disappear for the listener with no error surfaced. OpenAI's gpt-realtime-translate shows the same behavior on the same clip, and OpenAI documents skipping output-language speech as intended; this is a structural limit of current speech-to-speech translators on mixed-language audio.
Factual inversion on late-resolving syntax. On a Mandarin business-speech clip, a sentence describing a 15% sales increase rendered in English as a goal to increase sales by 15%. This is the error class that irreversible mid-sentence audio commitment produces when the source language postpones the meaning-carrying element (the polarity, the time reference, the subject) until late in the sentence.
These are independent measurements, not Google's own numbers; methodology and raw per-utterance data are in the published addendum.
6. How to Access Gemini 3.5 Live Translate
Consumer — Google Translate app
Update the Google Translate app to its latest version on Android or iOS. Live Translate mode is rolling out globally starting June 9, 2026 — availability depends on the store rollout schedule in your region. On Android, a new "listening mode" lets you hear translated speech directly through your device's earpiece.
Developer — Gemini Live API + Google AI Studio
The model is available in public preview through the Gemini Live API and through Google AI Studio. Per the launch coverage, the integration constraints are specific: audio input only (no text input in translation mode), no tool use or system instructions, raw 16-bit PCM 16 kHz mono input chunked at 100 ms, 24 kHz PCM output. Current quotas and pricing are on Google's Gemini API pricing page; Google AI Studio is the developer console for testing and key management.
Enterprise — Google Meet
Gemini 3.5 Live Translate is in private preview for select Google Workspace customers as of June 9, 2026. Where enabled, it expands Meet's translation coverage from 5 languages to 70+ languages and supports 2,000+ source/target combinations within a single meeting. Availability is rolling, not universal.
7. When to Use Gemini 3.5 — and When Another Tool Fits Better
When Gemini 3.5 Live Translate is the right choice
- You want translated speech, not translated text. The natural-voice output is the product's biggest advantage.
- You are already in the Google Translate app or Google Meet. Integration is zero-cost to discover and use.
- Your conversations are one-to-one, or have clear turn-taking with pauses between speakers. The voice-consistency limitations Google's model card discloses are weaker in these contexts.
- You are building a developer application where simplifying the STT → MT → TTS chain into a single API matters more than fine-grained control over each stage.
- You can live without speaker attribution in the audio output, and without streaming text transcripts.
When you might prefer a different tool
- You need streaming text alongside or instead of audio. Streaming text is what most production interfaces show on screen during live captioning, conference translation, and accessibility scenarios. Gemini 3.5 Live Translate's text is sidecar-only.
- You need per-speaker attribution in the translated output. The model card's "may get stuck on one voice during rapid multi-speaker sessions" disclosure makes this a real risk for meetings.
- You translate conversations where stability matters more than expressiveness. Audio output cannot be revised mid-utterance, so on languages with late-resolving syntax (Mandarin polarity at the sentence end, Japanese verb at the sentence end), an early commitment can invert the meaning. The benchmark addendum documents one such case.
- You need translated phone calls — dialing a PSTN number with translation running on the line. The Gemini Live API is a building block for developers, not a phone-call provider.
An honest concession. LiveLingo (publishing this guide) sits in the "different tool" category for several of those dimensions — streaming text + audio output, per-speaker attribution, gated-commit displayed transcripts that never retract, translated outbound phone calls. LiveLingo's audio output runs through the host platform's default text-to-speech engine (iOS native on Apple devices), which sounds materially less natural than Gemini 3.5 Live Translate's generated voice. That is a real advantage Google has shipped today. Side-by-side specs: /compare/google-translate. Benchmark numbers: /research/benchmark-2026. OpenAI's comparable surfaces: /guides/openai-live-translation.
8. Frequently Asked Questions
What is Gemini 3.5 Live Translate?
Gemini 3.5 Live Translate is a streaming speech-to-speech translation model released by Google on June 9, 2026. It is built on Gemini 3 Pro, generates translated audio that preserves the speaker's intonation, pacing, and pitch, and auto-detects 70+ languages. It is available to developers via the Gemini Live API and Google AI Studio (public preview), to consumers via the Google Translate app on Android and iOS, and to select Google Workspace customers via Google Meet (private preview).
What languages does Gemini 3.5 Live Translate support?
Over 70 languages, auto-detected. In Google Meet specifically, this expands previous coverage from 5 languages to 70+ languages and supports more than 2,000 source/target combinations within a single meeting.
How much does Gemini 3.5 Live Translate cost?
For consumers, the Google Translate app is free. Developer access via the Gemini Live API and Google AI Studio is priced per Google's standard API rates — check Google AI Studio for current pricing. Enterprise access via Google Meet is gated to select Google Workspace customers in private preview as of June 9, 2026.
How does Gemini 3.5 Live Translate handle multiple speakers?
Per the Gemini 3.5 Audio model card published by Google DeepMind: "Voices can be inconsistent, and voices may shift after long pauses, change gender, or get stuck on one voice during rapid multi-speaker sessions." Practically: one-to-one conversations and turn-taking discussions with clear pauses work well; rapid multi-speaker scenarios are a documented weakness. There is no per-speaker attribution in the translated audio output.
Does Gemini 3.5 Live Translate output text?
The primary output is translated speech. Text transcripts are available, but only as a sidecar of the spoken output — there is no streaming text mode, and the translation-mode API does not accept text input.
What is Gemini 3.5 Live Translate's measured latency?
Google describes the system as staying "a few seconds behind the speaker." Independent measurement by LiveLingo Research on launch day recorded a median first-audio latency of 2,947 ms (p10–p90: 2,859–3,104 ms) across 120 test utterances — a roughly 3-second constant speaking delay. Source: livelingo.io/research/benchmark-2026.
When was Gemini 3.5 Live Translate released?
Google announced and began rolling out Gemini 3.5 Live Translate on June 9, 2026, across the Gemini Live API and Google AI Studio (developer public preview), the Google Translate app on Android and iOS (global rollout starting that day), and Google Meet (private preview for select Workspace customers).
9. Sources
- Google. Fluid, natural voice translation with Gemini 3.5 Live Translate. Google blog, June 9, 2026. blog.google
- Google DeepMind. Gemini 3.5 Audio (Live Translate) — Model Card. deepmind.google
- Google. Gemini API pricing. ai.google.dev/gemini-api/docs/pricing
- Google. Google AI Studio (developer console). aistudio.google.com
- MarkTechPost. Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages, June 9, 2026. marktechpost.com
- Thurrott. New Gemini 3.5 Live Translate Model Provides Near Real-time Translation in Over 70 Languages. thurrott.com
- Android Headlines. Google Drops Gemini 3.5 Live Translate for Real-Time Conversations, June 9, 2026. androidheadlines.com
- StartupHub.ai. Google Rolls Out Gemini 3.5 Live Translate, 2026. startuphub.ai
- LiveLingo Research. Real-Time Voice Translation Benchmark 2026 — Gemini 3.5 Live Translate addendum, June 9, 2026. livelingo.io/research/benchmark-2026
- LiveLingo. OpenAI Live Translation (2026): ChatGPT Voice, gpt-realtime-translate, and Whisper+GPT Compared. livelingo.io/guides/openai-live-translation
Release date, language coverage, model card disclosures, and consumer/enterprise rollout details verified against the Google blog, Google DeepMind model card, and Gemini API documentation linked above on June 10, 2026. Google may change tiers, regional rollout, Workspace access, and model behavior; consult the linked sources for current state before relying on any specific number.