QUICK REVIEW

[論文レビュー] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang|arXiv (Cornell University)|Jun 14, 2024

Machine Learning in Healthcare被引用数 6

ひとこと要約

本論文は、医療用マルチモーダル幻覚ベンチマークである Med-HallMark、ならびに MediHall Score および MediHallDetector を提案し、医療タスクにおける LVLMs の幻覚を検出・分類・評価する。

ABSTRACT

Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

研究の動機と目的

医療用 LVLMs の信頼性を、医療の視覚-言語出力における幻覚に対処することで前進させる。
医療幻覚の検出と評価のためのドメイン特化ベンチマークを提供する。
幻覚の臨床的影響を測定する階層的分類と指標を開発する。
医療 LVLM 出力の幻覚を検出・分類する検出器モデルを作成する。
医療 VQA および imaging report generation タスク全体にわたるベースラインと洞察を提供する。

提案手法

マルチタスク幻覚サポート、多面的データ、階層的分類を備えた Med-HallMark を導入する。
5レベルの医療幻覚分類体系を定義する：Catastrophic、Critical、Attribute、Prompt-induced、Minor、さらに Correct statements。
MediHall Score を提案し、幻覚タイプごに数値スコアを割り当て、Med-VQA および IRG タスクの集計に用いる。
MediHallDetector を開発する。LLaVA 上に構築されたマルチモーダル検出器で、二層の分類器を備え、医療画像-テキストデータと Med-HallMark データを用いたファインチューニングを行う。
データ源を結合した単一段階の教師付きファインチューニングを用いて MediHallDetector を訓練する。

Figure 1: Illustration of statistical information and construction content of Med-HallMark. We show separately (a) multi-task hallucination support, (b) multifaceted hallucination data, and (c) hierarchical hallucination categorization.

実験結果

リサーチクエスチョン

RQ1LVLM 出力における医療幻覚を、医療 VQA および imaging report generation タスク全体で信頼性高く検出・分類するにはどうすればよいか？
RQ2ドメイン特化のベンチマークとスコアリング指標は、従来の NLP 指標より幻覚の臨床的影響をよりよく反映できるか？
RQ3専門の検出器は、汎用 LLM の評価と比べて幻覚タイプの精度と一貫性を改善するか？
RQ4提案された Med-HallMark ベンチマークで現代の医療 LVLM はどのようなベースラインを達成し、MediHall Score と MediHallDetector はそれらと比較してどう機能するか？

主な発見

Med-HallMark は、医療 LVLM のためのマルチタスクサポート、多面的データ、階層的幻覚分類を備えた包括的なベンチマークを提供する。
MediHall Score は、幻覚を階層ベースで評価するニュアンスのある指標を提供し、臨床影響を反映する点で従来指標を改善する。
MediHallDetector は、GPT-3.5 および GPT-4 のベースラインと比較して、人間が好む幻覚レベルでの検出性能と評価の一貫性が優れていることを示す。
Med-VQA および IRG タスク全体では、従来の指標は事実性や幻覚の程度を捉えきれないことが多い一方で、MediHall Score は幻覚の深刻度とより良く一致する。
MediHallDetector は、人間の好みとより高い一致度、比較的速い推論時間を示す他の LLM ベース評価手法より優れている。
アブレーション研究により、多様なタスクデータを単一の SFT フェーズで混在させることが、MediHallDetector の最良の性能を得ることを示す。

Figure 2: Visualization of MediHalldetector related information. (a) Model structure, SFT process and inference objective of MediHalldetector. (b) Examples of questions, LVLM answers and $GT$ for different types of tasks. (c) Comparison of three rounds of evaluation agreement and average inference t

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。