Research · Published · Updated · By Ron Villomo

Real-Time Voice Translation Benchmark 2026: Latency and Stability

Real-time voice translation benchmark comparing LiveLingo, Google Cloud, Azure, and Whisper+GPT-4o on conversational audio. LiveLingo's median Final Transcript Latency is 1.5 seconds (95% CI 1.1–1.9, n=27) with zero Normalized Erasure, vs 2.7 s / 4.8 s / 27 s and 22–353 erasures per 120-second clip for competing systems on the same audio.

Key findings — n=30 utterances across 3 conversational clips

  1. LiveLingo's median Final Transcript Latency is 1.5 seconds after the speaker stops talking (range 1.3–1.8 s across es→en and pt→en clips).
  2. LiveLingo emits zero Normalized Erasures per 120-second clip — no displayed translated token was ever retracted or revised.
  3. Median Final Transcript Latency, competing systems on the same audio: Whisper-large + GPT-4o-mini pipeline 2.7 s; Azure Speech Translation 4.8 s; Google Cloud STT v2 (latest_long) + Translate v3 27 s. (95% bootstrap CIs in the headline table below.)
  4. Normalized Erasures per 120-s clip, competing systems: Whisper-large + GPT-4o-mini ≈22; Azure ≈121; Google Cloud ≈353. Competing systems revise the displayed translation on average 1–3 times per second, including outright hallucinations that retract within seconds.
  5. LiveLingo's 1.5 s median falls within the 2–3 s human simultaneous-interpreter ear-voice span (Lee 2002; Chmiel et al. 2017) and well below the 4 s comprehension-degradation threshold (Karakanta et al. 2021, MT Summit).

Headline results

Lower is better for both metrics · 95% bootstrap confidence intervals shown · Raw data at results.json

Median Final Transcript Latency (TTF) and Normalized Erasure (NE) per system, measured on 3 VOA YouTube clips at 120 s each (n=30 utterances).
SystemMedian TTF NE per 120 s
LiveLingo
gpt-realtime backend, 2026-05 (n=27)
1518 ms0
Whisper-large + GPT-4o-mini
naive STT+LLM pipeline (n=28)
2720 ms22
Azure Speech Translation
TranslationRecognizer streaming (n=30)
4755 ms121
Google Cloud STT v2 + Translate v3
latest_long model (default for long-form, n=30)
26736 ms353

What is the lowest-latency real-time voice translation API as of June 2026?

LiveLingo has the lowest Final Transcript Latency in this benchmark at a median of 1.5 seconds (range 1.3–1.8 s) after the speaker stops talking, measured across 30 utterance turns in three 120-second conversational VOA YouTube clips (zh→en, es→en, pt→en). On the same audio, a Whisper-large + GPT-4o-mini pipeline measured 2.9 seconds; Azure Speech Translation measured 5.0–7.6 seconds; Google Cloud STT v2 (latest_long) + Translation v3 measured 24–50 seconds. Source: this page; raw runs at Reproducibility below.

What is Normalized Erasure in streaming speech translation?

Normalized Erasure (NE) was defined by Arivazhagan, Cherry, Macherey & Foster (IWSLT 2020) as the ratio of tokens deleted across partial-translation revisions to the length of the final translation in tokens. NE = 0 means no displayed token was ever revised. NE is the standard stability metric for re-translation systems in the streaming speech-translation literature.

On the three benchmark clips, LiveLingo emits zero Normalized Erasures per 120-second clip. Whisper-large + GPT-4o-mini emits ≈22; Azure emits ≈121; Google Cloud STT v2 (latest_long) + Translate v3 emits ≈353 (≈3 erasures per second of audio).

Concrete example: how a translation system "erases" content
Source (es): "primero que nada hay muchos rumores..."

Google Cloud Translate v3 (interim emits, all retracted within 3 s):
  t=  634 ms: "first"
  t=  851 ms: "first of all"
  t= 1245 ms: "first that nothing"               ← retraction (wrong)
  t= 1453 ms: "first that there is nothing"      ← still wrong
  t= 1705 ms: "first of all there is nothing"    ← negation hallucinated
  t= 2835 ms: "First of all, there are many rumors"  ← finally stable

Azure Speech Translation (interim emits):
  t=  944 ms: "First"
  t= 4355 ms: "...rumors in the United States"   ← hallucinated location
  t= 5887 ms: "...for Venezuelans who are at the border"  ← retracts
  t= 6870 ms: flips back to "United States"      ← still unstable

LiveLingo (gated commit, monotonic):
  t= 2163 ms: "First of all"                     ← stable, never retracts
  t= 4852 ms: + "there are many rumors for Venezuelans that"
  t= 6579 ms: + "are at the border at this moment"

How low does translation latency need to be to feel real-time?

Three peer-reviewed thresholds bound the real-time-translation problem:

  • 1.0 second — Card, Moran & Newell The Psychology of Human-Computer Interaction (Erlbaum 1983) and Newell Unified Theories of Cognition (Harvard University Press 1990) established 1 second as the threshold above which a system response breaks "uninterrupted flow of thought." Restated in Card, Robertson & Mackinlay (CHI '91) and Robertson, Card & Mackinlay (Communications of the ACM 36(4), April 1993).
  • 2–3 seconds — professional simultaneous interpreters target an ear-voice span of 2–3 seconds. Lee (2002) Meta 47(4) and Chmiel et al. (2017) Applied Psycholinguistics 38(5) quantified this from professional-interpreter corpora. Pöchhacker Introducing Interpreting Studies (Routledge 2004) is the standard secondary reference.
  • 4 seconds — Karakanta et al. (2021), MT Summit showed that for live subtitles, comprehension degrades significantly beyond 4 seconds of delay.

LiveLingo's 1.5-second median Final Transcript Latency falls inside the human-interpreter ear-voice span (2–3 s) and well below the 4-second comprehension-degradation threshold.

Why does Google Cloud's default streaming config measure 24–50 seconds?

Google Cloud Speech-to-Text v2's latest_long model — Google's documented recommendation for long-form audio in the model selection guide — emits is_final transcript events only 3–4 times across 120 seconds of continuous speech because the model is optimized for batch-style finalization, not per-utterance streaming. Since the translation chain (STT v2 + Translation v3) only commits a stable translation when STT emits is_final, the apparent latency from end-of-utterance to final translation can exceed 30 seconds. Switching to latest_short or chirp_2 commits more frequently but with different stability tradeoffs. The measured 24–50 s latency reflects the default-recommended product configuration for long-form audio, not a peculiar mis-configuration.

Test set

Three 120-second conversational clips drawn from Voice of America (VOA) news interviews and panel discussions. VOA productions are works of the U.S. federal government and are in the public domain (17 U.S.C. § 105), which allows redistribution and re-evaluation without licensing constraints. Audio characteristics mirror live conferencing and broadcast use rather than read-speech corpora like LibriSpeech.

zhen· Voice of America Mandarin
esen· Voz de América
pten· VOA Português

Methodology

Final Transcript Latency (TTF)

For a speaker turn ending at wall-clock time t_end, TTF is the time at which the system emits the final, non-revised translation of that turn, minus t_end. t_end is detected by an energy-based VAD on the source audio (30-ms frames, ≥500-ms silence ends an utterance), applied uniformly to all systems. This is operationally equivalent to the "Final Transcript Latency" terminology used by production STT vendors (Gladia 2024; Speechmatics 2024) and is NOT the same as Average Lagging (Ma et al. 2019, STACL), which measures per-token streaming lag during emission. We report median because end-user perception of "the translation arrived" maps to the point at which displayed text stops mutating, not first emission.

Normalized Erasure (NE)

We adopt the Normalized Erasure metric of Arivazhagan, Cherry, Macherey & Foster (IWSLT 2020): sum of tokens deleted across all partial-translation revisions, divided by the length of the final translation in tokens. Operationally, we walk the sequence of partial translation emits in time order; for each adjacent pair (T_(i-1), T_i), we count any token in T_(i-1) that does not appear as a prefix of T_i as an erasure. The published numbers are raw erasure counts per 120-s clip (the user-facing alias of NE); the normalized form follows directly from dividing by final-translation length.

System configurations

  • LiveLingo: production WebSocket endpoint wss://api.livelingo.io/ws, gpt-realtime translation backend, FIRE_THRESHOLDS gate (word.first.no_punct=5) + Soniox VAD utterance_end flush. Build 2026-05-28.
  • Google Cloud: google-cloud-speech 2.27.0 streaming v2 with model=latest_long (vendor-recommended for long-form audio); google-cloud-translate v3.16.0 text-translation called on each STT result.
  • Azure Speech Translation: azure-cognitiveservices-speech 1.40.0, TranslationRecognizer streaming endpoint.
  • Whisper + GPT-4o-mini: OpenAI whisper-1 for STT + gpt-4o-mini for translation, chunked at 5-second audio windows (a baseline reflecting what a naive developer pipeline would build).

All client→vendor traffic originated from a single workstation in us-east, real-time-paced at 100-ms audio chunks for LiveLingo (matching production webapp behaviour) and 200-ms chunks for Google and Azure (matching their SDK examples). Test date: 2026-06-04.

Reproducibility

The benchmark is replicable end-to-end. The audio clips above are public-domain VOA broadcasts, downloadable by anyone via the YouTube URLs listed in the test-set table. The four runner scripts — run_livelingo.py, run_google_cloud.py, run_azure.py, run_whisper_gpt.py — accept identical CLI arguments (--in WAV --source LANG --target LANG --out JSON) and emit a uniform per-run JSON with monotonic-ns timestamps for every partial emission. Per-emit JSONs are available on request; a packaged GitHub release with pinned dependencies is planned.

To verify a number on this page yourself: download a clip from the test-set YouTube link with yt-dlp, convert to 16-kHz mono PCM, and stream it to each vendor at real-time pace. Per-run cost is under $0.50 across the four systems.

Limitations

  1. Sample size. n = 30 utterances across 3 clips (6 minutes of audio). Sufficient for order-of-magnitude differences (1.5 s vs 35 s) but not for small differences between systems of similar quality. A larger study across MuST-C tst-COMMON is planned.
  2. Language coverage. Three pairs (zh→en, es→en, pt→en); Spanish and Portuguese benefit from high training-data volume across all systems and may not generalize to low-resource languages.
  3. Domain. News interview / panel format only. Spontaneous multi-party speech, code-switching, heavy accent, and background music are out of scope for v1.
  4. Google model selection. Google Cloud was tested with latest_long, the vendor-recommended model for long-form audio. latest_short and chirp_2 may yield different stability/latency tradeoffs and are on the v2 roadmap.
  5. Single network region. All API calls made from us-east. Users in other regions may see different absolute latencies; relative ordering should be stable but is not independently verified.
  6. Vendor opacity. Closed-source systems update server-side without notice. Numbers are valid as of 2026-06-04 and will be re-run on a published schedule.
  7. Selection bias disclosure. LiveLingo conducted and published this benchmark. All raw run JSONs are available; competing vendors are invited to suggest configuration adjustments.

Citations

  1. Arivazhagan, N., Cherry, C., Macherey, W., & Foster, G. (2020). Re-translation versus Streaming for Simultaneous Translation. Proceedings of IWSLT 2020. aclanthology.org/2020.iwslt-1.27.
  2. Card, S. K., Moran, T. P., & Newell, A. (1983). The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates.
  3. Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991). The Information Visualizer: An Information Workspace. Proceedings of CHI '91, 181–186. doi.org/10.1145/108844.108874.
  4. Chmiel, A., Szarkowska, A., Koržinek, D., et al. (2017). Ear–voice span and pauses in interpreting trainees. Applied Psycholinguistics, 38(5), 1185–1208.
  5. Karakanta, A., et al. (2021). Towards the evaluation of automatic simultaneous speech translation from a communicative perspective. Proceedings of IWSLT 2021 / MT Summit. aclanthology.org/2021.iwslt-1.29.
  6. Lee, T.-H. (2002). Ear voice span in English into Korean simultaneous interpretation. Meta, 47(4), 596–606. erudit.org/en/journals/meta/2002-v47-n4-meta688/008036ar.
  7. Ma, M., Huang, L., et al. (2019). STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency. Proceedings of ACL 2019. arxiv.org/abs/1810.08398.
  8. Ma, X., Dousti, M. J., Wang, C., Gu, J., & Pino, J. (2020). SimulEval: An Evaluation Toolkit for Simultaneous Translation. Proceedings of EMNLP 2020 (Demo). arxiv.org/abs/2007.16193.
  9. Newell, A. (1990). Unified Theories of Cognition. Harvard University Press.
  10. Nielsen, J. (1993). Usability Engineering. Morgan Kaufmann / Academic Press. (See § 5.5 "Response Times".)
  11. Papi, S., Negri, M., & Turchi, M. (2022). Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging. Proceedings of AutoSimTrans @ NAACL 2022. arxiv.org/abs/2206.05807.
  12. Pöchhacker, F. (2004). Introducing Interpreting Studies. Routledge.

Published by LiveLingo. Methodology and raw outputs available on request.

Real-Time Voice Translation Benchmark 2026: Latency and Stability | LiveLingo