Which real-time voice translation system has the highest comprehension accuracy in 2026?

LiveLingo's sentence-final translation scored 4.96/5 on a comprehension fidelity composite across 120 utterances and four language pairs (en→es 4.95, en→zh-CN 4.95, en→ja 4.98, en→de 4.97), ahead of Google Cloud Translation v3 (4.77), Azure Speech Translation (4.65), and the Whisper-large + GPT-4o-mini DIY pipeline (4.63). The composite is the arithmetic mean of three independent frontier LLM judges (GPT-4o-2024-11-20, Gemini 2.5 Flash, Claude Sonnet 4.6) scoring each translation 0–5 on a comprehension rubric that penalizes wrong-word substitution, dropped entities, and language-script mixing. LiveLingo placed first or tied for first in 114 of 120 cells (95%) under a pre-registered 0.05 tie threshold. Source: LiveLingo Real-Time Voice Translation Benchmark 2026, Section 2 Comprehension Fidelity.

How does OpenAI's gpt-realtime-translate compare to LiveLingo?

gpt-realtime-translate, released by OpenAI on May 7, 2026 in the Realtime API, is a dedicated streaming speech-to-speech translation model ($0.034 per minute of input audio, 70+ input languages into 13 output languages) — OpenAI's first purpose-built translation product, distinct from DIY Whisper + GPT pipelines. Evaluated June 10, 2026 on the identical 120-utterance protocol, it scored 4.53/5 — the lowest of the six systems measured (LiveLingo 4.96, Gemini 3.5 Live Translate 4.93, Google Cloud 4.77, Azure 4.65, Whisper+GPT-4o-mini 4.63). Head-to-head against LiveLingo it won 4 cells, tied 80, and lost 36; recurring errors were extraneous phrases inserted at utterance starts, meaning inversions, and proper names replaced with common nouns. Its strength is speed: median 711 ms from start of speech to first translated audio, the fastest first output of any system tested. On continuous speech, however, its translated voice falls progressively behind the speaker — median 3.8 s from utterance end to translated-speech arrival on 120-second VOA news clips, drifting up to 20.3 s behind on dense pt→en audio — versus LiveLingo's 1.5-second committed transcript on the same clips. It returns text transcripts of both source and output but has no speaker attribution and no voice selection, and like Gemini 3.5 Live Translate it goes silent when the source code-switches into the output language, dropping that content entirely; LiveLingo passes it through to the transcript. Source: livelingo.io/research/benchmark-2026#comprehension-openai-realtime.

How is translation quality evaluated using LLM-as-judge methodology in 2026?

This benchmark uses three independent frontier LLMs (GPT-4o, Gemini 2.5 Flash, Claude Sonnet 4.6) from three competing labs (OpenAI, Google, Anthropic) as independent judges. Each judge scores every translation 0–5 on a comprehension fidelity rubric, then the composite is the arithmetic mean. Using judges from three competing labs is the standard methodology mitigation for single-judge bias (cf. Chiang et al. 2024 Chatbot Arena; Zheng et al. 2023 MT-Bench). The 0.05 tie threshold treats sub-noise differences as ties at n=30 per language pair. Per-cell raw scores from all three judges are published for replication at /research/benchmark-2026/comprehension-results.json.

Research · Published 2026-06-04 · Updated 2026-06-09 · By Ron Villomo

Real-Time Voice Translation Benchmark 2026: Latency, Stability, and Comprehension

Name: Real-Time Voice Translation Benchmark 2026: Latency, Stability, and Comprehension
Creator: LiveLingo Research
Published: 2026-06-04
License: https://creativecommons.org/licenses/by/4.0/

Real-time voice translation benchmark comparing LiveLingo, Google Cloud, Azure, and Whisper+GPT-4o. Section 1: LiveLingo's median Final Transcript Latency is 1.5 s with zero Normalized Erasure, vs 2.7 / 4.8 / 27 s for competitors. Section 2: comprehension fidelity composite 4.96 / 5 (LiveLingo) vs 4.77 (Google), 4.65 (Azure), 4.63 (Whisper+GPT), n=120 across en→es/zh-CN/ja/de, scored by three independent frontier LLM judges.

Key findings — n=30 utterances across 3 conversational clips

LiveLingo's median Final Transcript Latency is 1.5 seconds after the speaker stops talking (range 1.3–1.8 s across es→en and pt→en clips).
LiveLingo emits zero Normalized Erasures per 120-second clip — no displayed translated token was ever retracted or revised.
Median Final Transcript Latency, competing systems on the same audio: Whisper-large + GPT-4o-mini pipeline 2.7 s; Azure Speech Translation 4.8 s; Google Cloud STT v2 (latest_long) + Translate v3 27 s. (95% bootstrap CIs in the headline table below.)
Normalized Erasures per 120-s clip, competing systems: Whisper-large + GPT-4o-mini ≈22; Azure ≈121; Google Cloud ≈353. Competing systems revise the displayed translation on average 1–3 times per second, including outright hallucinations that retract within seconds.
LiveLingo's 1.5 s median falls within the 2–3 s human simultaneous-interpreter ear-voice span (Lee 2002; Chmiel et al. 2017) and well below the 4 s comprehension-degradation threshold (Karakanta et al. 2021, MT Summit).

Headline results

Lower is better for both metrics · 95% bootstrap confidence intervals shown · Raw data at results.json

Median Final Transcript Latency (TTF) and Normalized Erasure (NE) per system, measured on 3 VOA YouTube clips at 120 s each (n=30 utterances).
System	Median TTF ↓	95% CI (bootstrap)	NE per 120 s ↓
LiveLingo production WebSocket, build 2026-05 (n=27)	1518 ms	[1096, 1852] ms	0
Whisper-large + GPT-4o-mini naive STT+LLM pipeline (n=28)	2720 ms	[1880, 3396] ms	22
Azure Speech Translation TranslationRecognizer streaming (n=30)	4755 ms	[3620, 9507] ms	121
Google Cloud STT v2 + Translate v3 latest_long model (default for long-form, n=30)	26736 ms	[20296, 51586] ms	353

Note (June 9, 2026): Google's 27-second figure measures its STT-then-translate chain, whose latest_long model holds translations until batch-style finalization rather than streaming them. Google's newly announced Gemini 3.5 Live Translate (developer-preview API) replaces that chain with direct speech-to-speech and removes the bottleneck: measured on the same three clips on its launch day, it spoke translations at a constant ~3.0-second delay (first translated audio 2.8–3.0 s on every clip). The consumer Google Translate app does not yet ship it but may adopt this streaming stack. Both systems translate speech to speech; they differ in voice and delay. Gemini 3.5 Live Translate speaks at a constant ~3-second delay in a voice designed to preserve the speaker's intonation and pacing (per Google's documentation, which also notes replication can drift in multi-speaker audio). LiveLingo speaks the committed translation through the device voice as soon as the text locks in — about 1.5 seconds — alongside a never-revised written transcript. Choose by priority: for the fastest accurate translation with a readable transcript, LiveLingo leads; if a natural, human-sounding translated voice matters most, Gemini 3.5 Live Translate is the strongest option we tested. One coverage note from the same runs: at 86 seconds the zh→en clip code-switches to street interviews spoken in English, the output language, and Gemini stops producing output from that point, leaving the final 34 seconds of content untranslated and untranscribed for the listener. Speech-to-speech translators skip speech already in the output language; LiveLingo passes it through to the transcript and rendered all three clips in full. Details in the Gemini 3.5 Live Translate addendum.

What is the lowest-latency real-time voice translation API as of June 2026?

LiveLingo has the lowest Final Transcript Latency in this benchmark — across all six systems measured — at a median of 1518 ms (95% bootstrap CI 1096–1852, n=27) after the speaker stops talking, measured across utterance turns in three 120-second conversational VOA YouTube clips (zh→en, es→en, pt→en). On the same audio, a Whisper-large + GPT-4o-mini pipeline measured 2720 ms (CI 1880–3396, n=28); Azure Speech Translation 4755 ms (CI 3620–9507, n=30); Google Cloud STT v2 (latest_long) + Translation v3 26736 ms (CI 20296–51586, n=30).

The two audio-to-audio addendum systems, measured on the same utterance-end-to-translation-arrival metric: Gemini 3.5 Live Translate median ~3.1 s (worst 13.9 s); OpenAI gpt-realtime-translate median ~3.8 s (drifting up to 20.3 s on dense audio). These models also emit a faster "first translated audio" measured from speech onset — Gemini ~2.9 s, OpenAI 711 ms — but that is a different metric (different baseline) and is reported separately in each addendum below, not in this TTF figure. Source: this page; raw runs at Reproducibility below.

What is Normalized Erasure in streaming speech translation?

Normalized Erasure (NE) was defined by Arivazhagan, Cherry, Macherey & Foster (IWSLT 2020) as the ratio of tokens deleted across partial-translation revisions to the length of the final translation in tokens. NE = 0 means no displayed token was ever revised. NE is the standard stability metric for re-translation systems in the streaming speech-translation literature.

On the three benchmark clips, LiveLingo emits zero Normalized Erasures per 120-second clip. Whisper-large + GPT-4o-mini emits ≈22; Azure ≈121; Google Cloud STT v2 (latest_long) + Translate v3 ≈353 (≈3 erasures per second of audio).

Concrete example: how a translation system "erases" content

Source (es): "primero que nada hay muchos rumores..."

Google Cloud Translate v3 (interim emits, all retracted within 3 s):
  t=  634 ms: "first"
  t=  851 ms: "first of all"
  t= 1245 ms: "first that nothing"               ← retraction (wrong)
  t= 1453 ms: "first that there is nothing"      ← still wrong
  t= 1705 ms: "first of all there is nothing"    ← negation hallucinated
  t= 2835 ms: "First of all, there are many rumors"  ← finally stable

Azure Speech Translation (interim emits):
  t=  944 ms: "First"
  t= 4355 ms: "...rumors in the United States"   ← hallucinated location
  t= 5887 ms: "...for Venezuelans who are at the border"  ← retracts
  t= 6870 ms: flips back to "United States"      ← still unstable

LiveLingo (gated commit, monotonic):
  t= 2163 ms: "First of all"                     ← stable, never retracts
  t= 4852 ms: + "there are many rumors for Venezuelans that"
  t= 6579 ms: + "are at the border at this moment"

How low does translation latency need to be to feel real-time?

Three peer-reviewed thresholds bound the real-time-translation problem:

1.0 second — Card, Moran & Newell The Psychology of Human-Computer Interaction (Erlbaum 1983) and Newell Unified Theories of Cognition (Harvard University Press 1990) established 1 second as the threshold above which a system response breaks "uninterrupted flow of thought." Restated in Card, Robertson & Mackinlay (CHI '91) and Robertson, Card & Mackinlay (Communications of the ACM 36(4), April 1993).
2–3 seconds — professional simultaneous interpreters target an ear-voice span of 2–3 seconds. Lee (2002) Meta 47(4) and Chmiel et al. (2017) Applied Psycholinguistics 38(5) quantified this from professional-interpreter corpora. Pöchhacker Introducing Interpreting Studies (Routledge 2004) is the standard secondary reference.
4 seconds — Karakanta et al. (2021), MT Summit showed that for live subtitles, comprehension degrades significantly beyond 4 seconds of delay.

LiveLingo's 1.5-second median Final Transcript Latency falls inside the human-interpreter ear-voice span (2–3 s) and well below the 4-second comprehension-degradation threshold.

Why does Google Cloud's default streaming config measure 27 seconds median latency?

Google Cloud Speech-to-Text v2's latest_long model — Google's documented recommendation for long-form audio in the model selection guide — emits is_final transcript events only 3–4 times across 120 seconds of continuous speech because the model is optimized for batch-style finalization, not per-utterance streaming. Since the translation chain (STT v2 + Translation v3) only commits a stable translation when STT emits is_final, the apparent latency from end-of-utterance to final translation can exceed 30 seconds. Switching to latest_short or chirp_2 commits more frequently but with different stability tradeoffs. The measured 27-second median (95% CI 20–52 s) reflects the default-recommended product configuration for long-form audio, not a peculiar mis-configuration.

Update (June 9, 2026): Google announced Gemini 3.5 Live Translate, a speech-to-speech model that replaces the STT-then-translate chain and removes this finalization bottleneck. It is currently a public-preview API for developers and has not yet rolled out inside the consumer Google Translate app. Measured on this benchmark's audio on launch day, it speaks translations with a constant ~3-second delay (median 2.9 s to first translated speech) — a large improvement over the 27-second STT-chain figure above, and still roughly double LiveLingo's 1.5-second median final-transcript latency. Full results in the Gemini 3.5 Live Translate addendum.

Section 2

Comprehension Fidelity

How accurately each system conveys source meaning, scored by three independent frontier LLM judges across 120 utterances and four language pairs (en→es, en→zh-CN, en→ja, en→de). 30 sentences per pair, evenly weighted across five conversational domains (restaurant, hotel, family, business, daily-life).

Placement summary — n=120 cells · 3-judge composite · tie threshold 0.05

LiveLingo placed first or tied for first on 114 of 120 cells (95%) by 3-judge composite score (GPT-4o-2024-11-20, Gemini 2.5 Flash, Claude Sonnet 4.6 — three frontier models from three competing labs).
Per-language composite means (0–5 scale): en→es 4.95; en→zh-CN 4.95; en→ja 4.98; en→de 4.97. All margins over runner-up exceed the pre-registered 0.05 tie threshold.
Real concessions: Azure ties LiveLingo on the hotel domain (5.00 each) and beats LiveLingo on the family domain (5.00 vs 4.94). LiveLingo trails on 6 of 120 cells outright; 4 of those 6 are food-vocabulary translation in restaurant contexts.
Other systems' composites: Google Cloud Translation v3 4.77; Azure Speech Translation 4.65; Whisper-large + GPT-4o-mini 4.63.
Combined with the Section 1 latency benchmark, LiveLingo is the only system in the top tier on both axes — top-quartile comprehension and category-leading sentence-final commit latency.

Per-language composite

Higher is better · 3-judge composite mean (GPT-4o + Gemini 2.5 Flash + Claude Sonnet 4.6) · per-cell raw data

3-judge composite mean per language pair (n=30 cells per pair). Delta = LiveLingo minus runner-up; positive = LiveLingo wins. All deltas exceed the 0.05 tie threshold.
Pair	LiveLingo	Whisper+GPT-4o-mini	Azure	Google	Δ runner-up
en → es	4.95	4.78	4.65	4.72	+0.17
en → zh-CN	4.95	4.57	4.58	4.73	+0.22
en → ja	4.98	4.50	4.65	4.80	+0.18
en → de	4.97	4.66	4.70	4.82	+0.15
Overall (n=120)	4.96	4.63	4.65	4.77	+0.20

Per-domain composite — including the cells where competitors win

Aggregated across all 4 language pairs (n=24 cells per domain × system). Bold = top score. Within 0.05 of the top counts as tied.

Domain	LiveLingo	Whisper+GPT-4o-mini	Azure	Google	Verdict
Restaurant	4.88	4.73	4.44	4.65	LiveLingo wins
Hotel	5.00	4.67	5.00	4.75	Tie with Azure
Family	4.94	4.90	5.00	4.98	Azure wins; LiveLingo tied with Google
Business	5.00	4.13	4.06	4.54	LiveLingo wins
Daily	5.00	4.63	4.73	4.92	LiveLingo wins

Six cells where LiveLingo loses

Listed for transparency. Of LiveLingo's 6 non-top placements in 120 cells, 4 are restaurant / food-vocabulary translations. The other 2 are family/emotional content in Japanese and Spanish.

Cell	Domain	Language	LiveLingo	Top	Winner(s)
rest-01	Restaurant	en→zh-CN	4.50	5.00	Whisper+GPT, Azure, Google (all tied)
rest-02	Restaurant	en→de	4.00	5.00	Azure
rest-03	Restaurant	en→zh-CN	4.00	5.00	Whisper+GPT, Azure, Google (all tied)
rest-01	Restaurant	en→es	4.50	5.00	Whisper+GPT
fam-02	Family	en→ja	4.50	5.00	Whisper+GPT, Azure
fam-06	Family	en→es	4.00	5.00	Whisper+GPT, Azure, Google (all tied)

Methodology

Audio. 30 English source utterances, 16 kHz mono WAV (download audio.zip, 3.2 MB), authored across five conversational domains. Each utterance tested in all four target languages.
Systems compared via developer APIs. LiveLingo sentence-final translation; Whisper-large + GPT-4o-mini (DIY pipeline); Azure Speech Translation; Google Cloud Speech v2 + Translation v3. All cloud-tier (consumer-facing Microsoft Translator and Google Translate require cloud for voice translation in their free apps).
Three independent judges. GPT-4o-2024-11-20, Gemini 2.5 Flash (non-thinking), Claude Sonnet 4.6. Each judge scored every translation 0–5 on a comprehension fidelity rubric (penalize wrong-word substitution, dropped entities, punctuation that changes sentence type, language-script mixing; do not penalize stylistic register or valid synonyms). Composite = arithmetic mean of the three judge scores per cell.
Tie threshold. Two systems whose mean composite scores differ by ≤0.05 on a given slice are reported as tied, treating that margin as within sampling noise at n=30 per language pair.
Raw per-cell data. comprehension-results.json — every translation, every judge score, every per-cell composite.

Limitations

n=30 per language pair is enough for directional claims with a 0.05 tie threshold but not for fine-grained sub-domain analysis. Statistical confidence intervals would need a larger sample.
Authored source text. Sentences were drafted across five domains to reflect typical conversational patterns. The audio is published so any reader can verify that the sentence mix matches realistic use.
LLM judges are proprietary models. Composite scores reduce single-judge bias but cannot eliminate it. We used three frontier models from three competing labs (OpenAI, Google, Anthropic) as the strongest available mitigation; inter-judge agreement was high but is not perfect.
Offline modes are out of scope — voice translation in Microsoft Translator and Google Translate apps requires cloud connection regardless of mode. See the methodology disclosure in the Section 1 Methodology box.

Addendum (June 9, 2026): Gemini 3.5 Live Translate

Evaluated on its launch day. Kept separate from the tables above, which report the original pre-registered four-system study.

On June 9, 2026 Google announced Gemini 3.5 Live Translate, a streaming speech-to-speech translation model. It is available to developers as a public-preview API and has not yet rolled out inside the consumer Google Translate app. We evaluated it the same day on the identical protocol: the same 120 published utterances, the same four language pairs, the same comprehension rubric and judge prompts (scored by the GPT-4o and Gemini 2.5 Flash judges; on the original four systems this two-judge composite matches the published three-judge composite to two decimals).

Comprehension composite: 4.93 / 5 — the strongest result yet from a competing system, well ahead of the Google Cloud STT v2 + Translation v3 stack (4.77), Azure (4.65), and Whisper+GPT-4o-mini (4.63). LiveLingo's 4.96 remains the highest composite on the benchmark.
Latency: a constant ~3-second speaking delay. Median 2,947 ms from start of speech to first translated audio across all 120 sessions (p10–p90: 2,859–3,104 ms) — dramatically faster than Google's STT-chain configuration measured in Section 1, and roughly double LiveLingo's 1,518 ms median final-transcript latency (LiveLingo's spoken rendering starts as the text commits, so its speech-to-speech delay is also ~1.5 s).
Voice quality is its differentiator. Both systems translate speech to speech; Gemini 3.5 Live Translate speaks in a voice designed to preserve the speaker's intonation and pacing (per Google's documentation, which also notes replication can shift after pauses or pick the wrong voice in rapid multi-speaker audio) — the strongest option we tested for natural, human-sounding spoken translation. LiveLingo speaks through the device voice and pairs it with a committed, never-revised written transcript. The Gemini API has no streaming text mode and no speaker attribution; text transcripts are available only as a sidecar of the spoken output, and spoken output cannot be revised after it is said, so early commitments are permanent.
Long-form behavior. This benchmark's utterances are 4–7 second sentences. On 40-second continuous business speech outside the benchmark format, we measured factual inversions in zh→en — a reported 15% sales increase rendered as a goal to increase sales by 15% — the error class that irreversible mid-sentence commitment produces on late-resolving sentence structures.
Code-switched speech on the Section 1 clips. Run on the same three 120-second VOA clips as the latency benchmark, it held its ~3-second lag throughout on es→en and pt→en (rolling lag stayed within 0.8–3.5 s across the full two minutes, and output speech volume matched the source). On the zh→en clip the source code-switches at 86 seconds: the final 34 seconds (~28% of the clip) are street interviews spoken in English, the session's output language. Gemini's translation output stops at that point in every run: speech already in the output language is not translated, re-spoken, or transcribed, so that content silently disappears for the listener with no error surfaced. This is a structural behavior of current speech-to-speech translators, not a Gemini-specific defect; OpenAI's gpt-realtime-translate (tested June 10, 2026) goes silent at the same 86-second mark on the same clip, and OpenAI's documentation states it may skip speech already in the output language. Real conversations code-switch constantly. LiveLingo passes target-language speech straight through to the transcript, so all three clips render in full.

Raw per-cell data: gemini-live-results.json — every translation, judge score, and per-session latency.

Addendum (June 10, 2026): OpenAI gpt-realtime-translate

Kept separate from the tables above, which report the original pre-registered four-system study.

On May 7, 2026 OpenAI released gpt-realtime-translate in its Realtime API: a dedicated streaming speech-to-speech translation model ($0.034 per minute of input audio, 70+ input languages into 13 output languages). It is OpenAI's first purpose-built translation product, distinct from the DIY Whisper + GPT-4o-mini pipeline measured in the main study. We evaluated it on June 10, 2026 on the identical protocol: the same 120 published utterances, the same four language pairs, the same comprehension rubric and judge prompts (GPT-4o and Gemini 2.5 Flash judges; note the GPT-4o judge is OpenAI's own model).

Comprehension composite: 4.53 / 5 — the lowest of the six systems measured on this benchmark (LiveLingo 4.96, Gemini 3.5 Live Translate 4.93, Google Cloud 4.77, Azure 4.65, Whisper+GPT-4o-mini 4.63). Head-to-head against LiveLingo at the cell level: 4 wins, 80 ties, 36 losses. The recurring error classes: extraneous phrases prepended at utterance starts ("I'm not really hungry" opened with "Una relación."), meaning inversions ("I was stressed about work" rendered as a wish to be stressed), and proper names replaced with common nouns ("Marcus" became "the market" in Chinese).
Fastest first audio of any system tested. Median 711 ms from start of speech to first translated audio across all 120 sessions (p10–p90: 485–1,012 ms), four times faster to first output than Gemini 3.5 Live Translate's ~2.9 s. Speed is this model's genuine strength.
But it falls behind on continuous speech. On the Section 1 VOA clips, measuring from each source-utterance end (the same energy-VAD boundaries as the published latency table) to the arrival of the translated speech for that utterance: median 3.8 s (n=25), drifting as far as 20.3 s behind the speaker on the dense pt→en clip as untranslated backlog accumulated. Gemini 3.5 Live Translate on the same method: median 3.1 s, worst 13.9 s. LiveLingo's committed transcript on the same clips: median 1.5 s, because translated text does not need to be spoken aloud at speech pace to be delivered.
Output surfaces. It returns translated audio plus text transcripts of both the source speech and the translated output — closer to LiveLingo's transcript-first surface than Gemini's audio-only API. There is no speaker attribution, no voice selection, and spoken output cannot be revised after it is said.
Code-switched speech: same gap as Gemini. On the zh→en VOA clip it goes silent at the 86-second switch to English speech, exactly as Gemini does (detailed in the addendum above); OpenAI's documentation states it may skip speech already in the output language. LiveLingo passes target-language speech through to the transcript and rendered all three clips in full.

Raw per-cell data: openai-realtime-results.json — every translation, judge score, and per-session latency. Same source audio as all other systems (audio.zip above).

Test set

Three 120-second conversational clips drawn from Voice of America (VOA) news interviews and panel discussions. VOA productions are works of the U.S. federal government and are in the public domain (17 U.S.C. § 105), which allows redistribution and re-evaluation without licensing constraints. Audio characteristics mirror live conferencing and broadcast use rather than read-speech corpora like LibriSpeech.

zh→en: 时事大家谈: Biden-Xi summit panel discussion (zh→en, 120s) (YouTube, 90–210s)
es→en: Exclusiva VOA: programa migratorio para venezolanos (es→en, 120s) (YouTube, 0–120s)
pt→en: Grande Entrevista: Adalberto Costa Jr (UNITA, Angola) (pt→en, 120s) (YouTube, 30–150s)

zh→en· Voice of America Mandarin

es→en· Voz de América

pt→en· VOA Português

Methodology

Final Transcript Latency (TTF)

For a speaker turn ending at wall-clock time t_end, TTF is the time at which the system emits the final, non-revised translation of that turn, minus t_end. t_end is detected by an energy-based VAD on the source audio (30-ms frames, ≥500-ms silence ends an utterance), applied uniformly to all systems. This is operationally equivalent to the "Final Transcript Latency" terminology used by production STT vendors (Gladia 2024; Speechmatics 2024) and is NOT the same as Average Lagging (Ma et al. 2019, STACL), which measures per-token streaming lag during emission. We report median because end-user perception of "the translation arrived" maps to the point at which displayed text stops mutating, not first emission.

Normalized Erasure (NE)

We adopt the Normalized Erasure metric of Arivazhagan, Cherry, Macherey & Foster (IWSLT 2020): sum of tokens deleted across all partial-translation revisions, divided by the length of the final translation in tokens. Operationally, we walk the sequence of partial translation emits in time order; for each adjacent pair (T_(i-1), T_i), we count any token in T_(i-1) that does not appear as a prefix of T_i as an erasure. The published numbers are raw erasure counts per 120-s clip (the user-facing alias of NE); the normalized form follows directly from dividing by final-translation length.

System configurations

LiveLingo: production WebSocket endpoint at livelingo.io/app, gated-commit streaming with utterance-boundary detection on the inbound audio stream. Build 2026-05-28.
Google Cloud: google-cloud-speech 2.27.0 streaming v2 with model=latest_long (vendor-recommended for long-form audio); google-cloud-translate v3.16.0 text-translation called on each STT result.
Azure Speech Translation: azure-cognitiveservices-speech 1.40.0, TranslationRecognizer streaming endpoint.
Whisper + GPT-4o-mini: OpenAI whisper-1 for STT + gpt-4o-mini for translation, chunked at 5-second audio windows (a baseline reflecting what a naive developer pipeline would build).

All client→vendor traffic originated from a single workstation in us-east, real-time-paced at 100-ms audio chunks for LiveLingo (matching production webapp behaviour) and 200-ms chunks for Google and Azure (matching their SDK examples). Test date: 2026-06-04.

Reproducibility

The benchmark is replicable end-to-end. The audio clips above are public-domain VOA broadcasts, downloadable by anyone via the YouTube URLs listed in the test-set table. The four runner scripts — run_livelingo.py, run_google_cloud.py, run_azure.py, run_whisper_gpt.py — accept identical CLI arguments (--in WAV --source LANG --target LANG --out JSON) and emit a uniform per-run JSON with monotonic-ns timestamps for every partial emission. Per-emit JSONs are available on request; a packaged GitHub release with pinned dependencies is planned.

To verify a number on this page yourself: download a clip from the test-set YouTube link with yt-dlp, convert to 16-kHz mono PCM, and stream it to each vendor at real-time pace. Per-run cost is under $0.50 across the four systems.

Download the data

All results are published under a CC-BY 4.0 licence — reuse them freely with attribution:

results.json — Section 1 latency/stability summary per system.
results.csv — Section 1 per-utterance metrics (latency, erasure, chrF, COMET).
comprehension-results.json — Section 2 per-cell translations and raw judge scores (n=120 × 4 systems).
comprehension-results.csv — Section 2 as a flat CSV, one row per (utterance, system).
audio.zip — the 30 English source audio files (16 kHz mono WAV, 3.2 MB).
gemini-live-results.json / openai-realtime-results.json — speech-to-speech addenda runs.

Limitations

Sample size. n = 27–30 utterances per system across 3 clips (6 minutes of audio total). Sufficient for the order-of-magnitude differences reported (1.5 s vs 27 s) but not for small differences between systems of similar quality; 95% bootstrap CIs are reported per row. A larger study across MuST-C tst-COMMON is planned.
Language coverage. Three pairs (zh→en, es→en, pt→en); Spanish and Portuguese benefit from high training-data volume across all systems and may not generalize to low-resource languages.
Domain. News interview / panel format only. Spontaneous multi-party speech, code-switching, heavy accent, and background music are out of scope for v1.
Google model selection. Google Cloud was tested with latest_long, the vendor-recommended model for long-form audio. latest_short and chirp_2 may yield different stability/latency tradeoffs and are on the v2 roadmap.
Single network region. All API calls made from us-east. Users in other regions may see different absolute latencies; relative ordering should be stable but is not independently verified.
Vendor opacity. Closed-source systems update server-side without notice. Numbers are valid as of 2026-06-04 and will be re-run on a published schedule.
Selection bias disclosure. LiveLingo conducted and published this benchmark. All raw run JSONs are available; competing vendors are invited to suggest configuration adjustments.

Citations

Arivazhagan, N., Cherry, C., Macherey, W., & Foster, G. (2020). Re-translation versus Streaming for Simultaneous Translation. Proceedings of IWSLT 2020. aclanthology.org/2020.iwslt-1.27.
Card, S. K., Moran, T. P., & Newell, A. (1983). The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates.
Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991). The Information Visualizer: An Information Workspace. Proceedings of CHI '91, 181–186. doi.org/10.1145/108844.108874.
Chmiel, A., Szarkowska, A., Koržinek, D., et al. (2017). Ear–voice span and pauses in interpreting trainees. Applied Psycholinguistics, 38(5), 1185–1208.
Karakanta, A., et al. (2021). Towards the evaluation of automatic simultaneous speech translation from a communicative perspective. Proceedings of IWSLT 2021 / MT Summit. aclanthology.org/2021.iwslt-1.29.
Lee, T.-H. (2002). Ear voice span in English into Korean simultaneous interpretation. Meta, 47(4), 596–606. erudit.org/en/journals/meta/2002-v47-n4-meta688/008036ar.
Ma, M., Huang, L., et al. (2019). STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency. Proceedings of ACL 2019. arxiv.org/abs/1810.08398.
Ma, X., Dousti, M. J., Wang, C., Gu, J., & Pino, J. (2020). SimulEval: An Evaluation Toolkit for Simultaneous Translation. Proceedings of EMNLP 2020 (Demo). arxiv.org/abs/2007.16193.
Newell, A. (1990). Unified Theories of Cognition. Harvard University Press.
Nielsen, J. (1993). Usability Engineering. Morgan Kaufmann / Academic Press. (See § 5.5 "Response Times".)
Papi, S., Negri, M., & Turchi, M. (2022). Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging. Proceedings of AutoSimTrans @ NAACL 2022. arxiv.org/abs/2206.05807.
Pöchhacker, F. (2004). Introducing Interpreting Studies. Routledge.

Cite this benchmark

LiveLingo Research (2026). Real-Time Voice Translation Benchmark 2026: Latency, Stability, and Comprehension. https://www.livelingo.io/research/benchmark-2026 https://doi.org/10.5281/zenodo.21250032

@misc{livelingo2026benchmark2026,
  author       = {{LiveLingo Research}},
  title        = {Real-Time Voice Translation Benchmark 2026: Latency, Stability, and Comprehension},
  year         = {2026},
  howpublished = {\url{https://www.livelingo.io/research/benchmark-2026}},
  doi          = {10.5281/zenodo.21250032},
  note         = {Dataset (CC-BY 4.0): https://www.livelingo.io/research/benchmark-2026/comprehension-results.json},
}

Key findings — n=30 utterances across 3 conversational clips

Headline results

What is the lowest-latency real-time voice translation API as of June 2026?

What is Normalized Erasure in streaming speech translation?

How low does translation latency need to be to feel real-time?

Why does Google Cloud's default streaming config measure 27 seconds median latency?

Placement summary — n=120 cells · 3-judge composite · tie threshold 0.05

Per-language composite

Per-domain composite — including the cells where competitors win

Six cells where LiveLingo loses

Methodology

Limitations

Addendum (June 9, 2026): Gemini 3.5 Live Translate

Addendum (June 10, 2026): OpenAI gpt-realtime-translate

Test set

Methodology

Final Transcript Latency (TTF)

Normalized Erasure (NE)

System configurations

Reproducibility

Download the data

Limitations

See also: head-to-head comparisons

Citations

Cite this benchmark