QUICK REVIEW

[論文レビュー] How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

Siraj Ul Islam, Anne Lauscher|arXiv (Cornell University)|Feb 18, 2025

Text Readability and Simplification被引用数 4

ひとこと要約

この論文は多言語の幻覚検出モデルを訓練し、30言語評価スイート（mFAVA）を作成し、11のオープンソースモデルに対する実運用下でのLLM幻覚率を推定する。小型モデルほど幻覚を起こしやすく、言語ごとのリソースと幻覚率には相関がないことを発見した。

ABSTRACT

In the age of misinformation, hallucination - the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses - represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

研究の動機と目的

英語中心のタスクを超えた多言語の幻覚評価の必要性を動機付ける。
英語データからの翻訳訓練で訓練された多言語幻覚検出（HD）モデルを開発する。
HD性能を検証するために、30言語にわたるGoldとSilverの評価データ（mFAVA）を作成する。
言語間でのLLMの幻覚率を推定するプロトコルを提案する。
モデルサイズ、言語カバレッジ、出力長が幻覚率とどう関係するかを分析する。

提案手法

Translate-train: 30言語へ英語FAVA訓練データを翻訳して多言語HDモデルを訓練する。
幻覚評価データ: 知識探求回答に幻覚を導入させるようGPT-4を prompting してmFAVA-Silverを作成する；知識分野の高リソース言語5言語について人間注釈付きのmFAVA-Goldを収集する。
モデルアーキテクチャ: 冷凍されたLlama-3-8B-baseまたは同等モデル上でQLoraアダプタをファインチューニングし、BinaryとCategoryタスクを含む単言語および多言語HDモデルを作成する。
評価: 言語横断でのトークンレベルの精度と再現率を測定し、Silver部分とGold部分のHD性能を比較する。
幻覚率推定: HR_est,l を式 HR_est,l = (P_l * H_det,l) / (R_l * N_l) * 100 に従って計算する。ここで H_det,l は検出された幻覚トークン、N_l は総トークン、P_l は精度、R_l は再現率。
知識強度データセット: ウィキペディアの参照とLLM生成回答から30言語コーパスを構築し、11の指示調整済みオープンソースLLMをカバーする。

実験結果

リサーチクエスチョン

RQ1多言語HDモデルの言語ごとの幻覚検出性能はどの程度か。
RQ2実運用下の幻覚率は言語とLLMファミリーによってどう変動するか。
RQ3Silver（GPT-4生成）注釈は言語間の幻覚率推定のGold人間注釈を信頼できる近似となるか。
RQ4モデルサイズと言語カバレッジは野外の幻覚率にどのような影響を与えるか。

主な発見

言語間の幻覚率は11のLLMで平均して7%〜12%の範囲。
studiedファミリー全体で小型モデルは大型モデルより幻覚を起こしやすい。
より多くの言語をサポートすると主張するモデルは幻覚率が高い傾向。
出力長が長いほど幻覚トークンが増えるが、1トークンあたりの幻覚率は長さと相関しない。
多言語HDモデルはモノリンガルモデルより優れており、特に細粒度カテゴリ検出で顕著。
Goldデータを持つ5言語でSilverとGoldから導出したHR_estの相関は強く(r = 0.83)Silverベースの推定を他言語にも適用する根拠になる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。