WHO Report: AI Fails All Simultaneous Interpretation Tests.

DATE

25.06.25

Dominican Republic conference and meeting destinations like Punta Cana and Santo Domingo have reported important implementation and technology failures when using remote interpretation. Although most larger hotels have their own electricity generation and the constant power outages and generator changes do not affect your conference if all your services and equipment are on-site, this is not the case when you use remote interpretation with several 20-30 second electricity interruptions per day, and systems have to be reset and reconfigured every single time. Other implementation failures risks exist such as electricity frequence fluctuations, lack of qualified and well-trained technicians on the ground to represent interpretation platforms with vendors offering interpreters 6,000 miles away, broadband intermittence, etc. You won´t have time to fix surprises. Only organizers with a very high tolerance to risk of failure use remote language interpretation for on-site conferences and conventions in destinations where professional interpreters are available.

It becomes even worse (x100) when organizers try to implement remote interpretation with artificial intelligence in high stake negotiations, business forums, diplomatic conferences and summits, and professional technical conventions. Imagine spending millions of dollars bringing hundreds of cardiovascular surgeons from all over the world into Latin America to have them sit down in a conference room and listen all day to intelligible garble provided by AI, with no meaning, inflection or intonation at its best. Only organizers, professional associations and multilateral organizations and corporations with an extremely high-risk tolerance to reputational damage may afford this luxury.

A recent report by the World Health Organization’s (WHO) Simultaneous Interpretation Service has revealed that artificial intelligence systems used for live interpretation deliver overwhelmingly poor results. According to the findings, 98.89% of the evaluated AI simultaneous interpretation fell below the minimum required quality threshold of 75%, with an average score of just 46%. Moreover, every single interpretation reviewed contained critical errors that could potentially damage the organization’s reputation or lead to diplomatic incidents.

The report was published in May 2025. It analyzed 18 spoken interventions delivered in the six official WHO languages – Arabic, Chinese, English, French, Russian, and Spanish – during the World Health Assembly.

The speeches were assessed using the same criteria applied in European Master’s programs in conference interpretation (such as EMCI) and in official exams for international institutions. These criteria include:

Accuracy and fidelity to the original message
Natural reformulation
Terminological precision
Analytical and summarizing skills
Oral production (fluency, pronunciation, intonation
Coherence and cohesion
Stress management and self-control
Time lag

Out of 90 AI-generated interpretations, 89 scored as low as 5% in some cases. Based on these results, the WHO concludes that AI interpretation platforms don´t work for meetings with external participants.

If you wish to read the full WHO report, please google WHO REPORT ON AI SIMULTAENOUS INTERPRETATION

Most Common AI Interpretation Failures

The report highlights several recurring issues found in standard elements of international conference speeches:

Language Identification and Code Switching

AI systems often struggled to detect the original language or switch correctly between languages. This led to missing entire sentences or, in some cases, repeating the speaker’s words in the same language but inaccurately, causing confusion and disruption.

Speech Speed and Déclage (Interpretation Lag)

Although AI handled fast-paced speeches, it often omitted content to keep up. Additionally, the lag between the original speech and the translated output was excessive. While human interpreters usually maintain a lag of 1–5 seconds, the AI systems required delays of up to 23 seconds, breaking the flow of real-time comprehension.

Errors with Names and Place Names

AI made some of the most serious mistakes when translating proper names and geographical terms. Notable errors include:

Brunei Darussalam rendered as “the brown Russell”
Haiti interpreted as Heidi
“Director Moeti” mistranslated as “our African” or “Mr. Moeti”
The country name Greece translated as personal name Chris

A particularly severe mistake occurred when a Spanish speech referred to Hamas, and the AI rendered it in Arabic as United States, which could have led to a serious diplomatic crisis.

Numbers and Dates

AI frequently misinterpreted numerical data, especially large numbers or date-related references. This kind of inaccuracy is unacceptable in negotiations or conferences where data plays a critical role.

Technical Terminology

The systems repeatedly failed with specialized terms. In one case, “polio transmission” was translated from Arabic as “transport,” and “hepatitis” was rendered as “Ebola” from French to Arabic.

Voice Quality and Expression

While not formally graded, the report noted that the AI-generated voice was “extremely monotonous and lacked expressiveness,” making it difficult for listeners to stay engaged, particularly in sessions lasting longer than 30 minutes.

Conclusion

Given the range and severity of the issues identified, the WHO firmly advises against using AI for simultaneous interpretation in international settings where message integrity, precise terminology, and diplomatic sensitivity are crucial. The risk of communication breakdown and reputational harm is simply too high.

The recommendation is to limit the use of such technology strictly to internal organizational meetings, where occasional translation inaccuracies would not result in public or diplomatic consequences.

DATE

Most Common AI Interpretation Failures

Language Identification and Code Switching