QUICK REVIEW

[論文レビュー] Quantifying Hallucinations in Language Language Models on Medical Textbooks

Brandon Colelough, Davis Bartels|arXiv (Cornell University)|Feb 12, 2026

Topic Modeling被引用数 0

ひとこと要約

論文は、モデル出力を権威ある教科書の記述へ結びつけることで医療QAの幻覚を測定するテキスト根拠型のベンチマークを紹介し、次にモデル間および臨床医の好みで幻覚発生率を評価する。

ABSTRACT

Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and ,2 respectively

研究の動機と目的

教科書に基づくプロンプトと権威ある抜粋へのリンクを用いて、医療QAにおける幻覚を動機づけて定量化する。
専門家検証を用いた汚染耐性ベンチマーク（NameAnonymized）の開発・展開。
novel prompts を用いたLLaMA-70B-Instructの基準的幻覚発生率を測定。
八つのモデルを幻覚頻度と臨床医の有用性の観点で評価。
モデルの幻覚率と臨床医の好みの整合性を分析。

提案手法

四つの品質ヒューリスティクスを満たす公刊医療教科書からコーパスを構築。
LLaMA-70B-Instructを用いて七形式の多様なQAペアを自動生成。
医療に精通したアノテータによる出典 passages とのQA-回答ペアの手動検証。
ベンチマークをゼロショットで八つの言語モデルに適用し、臨床医のランキングを収集。
幻覚率、妥当性、回答可能性、アノテータ間の一致を算出。
幻覚頻度と臨床医の有用性との相関を分析。

実験結果

リサーチクエスチョン

RQ1RQ1: LLaMA-70B-Instruct は教科書由来の医療QAプロンプトでどの程度幻覚を起こすか。
RQ2RQ2: 幻覚率はモデル間でどう変動し、重症度はどの程度か。
RQ3RQ3: 臨床医はモデル回答をどのようにランク付けし、これらのランキングと幻覚指標をどう整合させるか。

主な発見

Experiment	Metric	Value	Notes
Experiment 1	Hallucination rate	19.7% (95% CI 18.6% to 20.7%)	LLaMA-70B-Instruct baseline

LLaMA-70B-Instruct の基準幻覚発生率：19.7%（95%信頼区間 18.6% 〜 20.7%）。
幻覚があっても妥当性は高く、回答の98.8%が最大妥当性を示した。
モデル間で幻覚率が低いほど有用性スコアが高い相関を示す（ρ = -0.71, p = 0.058）。
アノテータの一致は概ね高く、妥当性 κ ≈ 0.92、関連性 κ ≈ 0.94。
大規模モデルは一般に幻覚を抑制した（1Bで27.1%、70Bで9.3%だが、全モデルで幻覚と不良回答は依然発生）。
臨床医のランキングにはモデル間で強い系統的な偏りは見られず（ Kendall’s τ 上位8モデルで約0.18〜0.06 の範囲）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。