QUICK REVIEW

[Paper Review] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang|arXiv (Cornell University)|Jun 14, 2024

Machine Learning in Healthcare6 citations

TL;DR

The paper introduces Med-HallMark, a medical multimodal hallucination benchmark, plus MediHall Score and MediHallDetector to detect, categorize, and evaluate hallucinations in LVLMs for medical tasks.

ABSTRACT

Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

Motivation & Objective

Advance reliable medical LVLMs by addressing hallucinations in medical visual-language outputs.
Provide a domain-specific benchmark for detection and evaluation of medical hallucinations.
Develop a hierarchical categorization and metrics to measure clinical impact of hallucinations.
Create a detector model to detect and classify hallucinations in medical LVLM outputs.
Offer baselines and insights across medical VQA and imaging report generation tasks.

Proposed method

Introduce Med-HallMark with multi-task hallucination support, multifaceted data, and hierarchical categorization.
Define a five-level medical hallucination taxonomy: Catastrophic, Critical, Attribute, Prompt-induced, Minor, plus Correct statements.
Propose MediHall Score that assigns numeric scores per hallucination type and aggregates for Med-VQA and IRG tasks.
Develop MediHallDetector, a multimodal detector built on LLaVA with a dual-layer classifier and fine-tuning on medical image-text data plus Med-HallMark data.
Use single-stage supervised fine-tuning with combined data sources to train MediHallDetector.

Figure 1: Illustration of statistical information and construction content of Med-HallMark. We show separately (a) multi-task hallucination support, (b) multifaceted hallucination data, and (c) hierarchical hallucination categorization.

Experimental results

Research questions

RQ1How can we reliably detect and categorize medical hallucinations in LVLM outputs across medical VQA and imaging report generation tasks?
RQ2Can a domain-specific benchmark and scoring metric better reflect the clinical impact of hallucinations than traditional NLP metrics?
RQ3Does a specialized detector improve accuracy and consistency in identifying hallucination types compared to generic LLM assessments?
RQ4What baselines do contemporary medical LVLMs achieve on the proposed Med-HallMark benchmark, and how do MediHall Score and MediHallDetector perform relative to them?

Key findings

Med-HallMark provides a comprehensive benchmark with multi-task support, multifaceted data, and hierarchical hallucination categorization for medical LVLMs.
MediHall Score offers a nuanced, hierarchy-based evaluation of hallucinations, improving upon traditional metrics in reflecting clinical impact.
MediHallDetector demonstrates superior detection performance and evaluation consistency compared to GPT-3.5 and GPT-4 baselines in human-preferred hallucination levels.
Across Med-VQA and IRG tasks, traditional metrics often fail to capture factual correctness or degree of hallucination, whereas MediHall Score aligns better with hallucination severity.
MediHallDetector achieves higher agreement with human preferences and faster inference times than comparative LLM-based evaluation approaches.
Ablation studies show mixing diverse task data in a single SFT phase yields best performance for MediHallDetector.

Figure 2: Visualization of MediHalldetector related information. (a) Model structure, SFT process and inference objective of MediHalldetector. (b) Examples of questions, LVLM answers and $GT$ for different types of tasks. (c) Comparison of three rounds of evaluation agreement and average inference t

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.