QUICK REVIEW

[論文レビュー] Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Yiqiao Jin, Mohit Chandra|arXiv (Cornell University)|Oct 19, 2023

Artificial Intelligence in Healthcare and Education被引用数 10

ひとこと要約

本論文は XLingEval というクロスリンガル枠組みと、英語・スペイン語・中国語・ヒンディー語のLLMを評価する多言語医療ベンチマーク XLingHealth を提案し、正確性、一貫性、検証可能性における重要な言語差を明らかにしている。

ABSTRACT

Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.

研究の動機と目的

高リスク分野における英語以外の言語での LLM 評価を通じ、医療情報への公平なアクセスを促進する。
正確性・一貫性・検証可能性に焦点を当てた多言語評価フレームワーク（XLingEval）の提案。
4つの広く話される言語にまたがる多言語医療ベンチマーク（XLingHealth）の作成。
実世界の医療QAデータセットを用いて、複数のLLMにおけるクロスリンガルな性能と一般化能力を評価する。

提案手法

医療クエリの3つの主要評価基準を定義する：正確性、一貫性、検証可能性。
自動評価と人間評価の構成要素を備えた XLingEval を開発し、言語を跨いで専門家の真実値とLLM出力を比較する。
英語の医療QAデータセット（HealthQA、LiveQA、MedicationQA）をヒンディー語、中国語、スペイン語へ翻訳し、医療専門家の協力を得て XLingHealth を構築する。
GPT-3.5 および MedAlpaca-30b を用いた多言語実験を実施し、データセットと言語全体での言語格差を分析する。
ANOVA、Tukey HSD、t 検定などの統計検定を適用して、クロス言語の性能差の有意性を判断する。
表層・意味・トピックレベルでの一貫性を評価するため、複数の類似度指標（n-gram、BERTScore、Sentence Embedding）とトピックモデル（LDA、HDP）を使用する。
検証可能性を、モデルを正誤クレームの検出器としてデータセット全体で扱うことで評価する。

Figure 1 . We present XLingEval , a comprehensive framework for assessing cross-lingual behaviors of LLMs for high risk domains such as healthcare. We present XLingHealth , a cross-lingual benchmark for healthcare queries.

実験結果

リサーチクエスチョン

RQ1英語、スペイン語、中国語、ヒンディー語での医療クエリに対して、LLMはどのように性能を発揮するか？
RQ2正確性、一貫性、検証可能性は医療Q&Aでクロスリンガルな格差を示すか？
RQ3XLingEval フレームワークは、多言語間のギャップを信頼性高く検出し、クロスリンガルな医療情報アクセスの改善を導けるか？
RQ4XLingHealth のような多言語ベンチマークは他のドメインや他のモデルにも一般化可能か？

主な発見

4言語にまたがって正確性に顕著な格差があり、非英語クエリは GPT-3.5 にとって英語よりも誤答が多い。
GPT-3.5 は非英語クエリに対して、英語と比べて誤答の確率が 5.82x 高い。
一部の指標では、ヒンディー語で最大 50.5%、中国語で 28.3% の性能低下が英語と比較して見られる。
検証可能性は中国語とヒンディー語で notably 弱く、英語とスペイン語は比較的高い（HealthQA: 英語 vs 中国語/ヒンディー語）。
MedAlpaca-30b は GPT-3.5 とは異なる言語格差パターンを示し、モデル依存のクロスリンガル挙動を強調している。
ANOVA は指標とモデル間で統計的に有意な言語差を示し、英語-スペイン語がしばしば性能が近い一方で、他のペア間にはより大きなギャップがある。

Figure 2 . Evaluation pipelines for correctness, consistency, and verifiability criteria in the XLingEval framework.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。