[Paper Review] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
The paper introduces Med-HallMark, a medical multimodal hallucination benchmark, plus MediHall Score and MediHallDetector to detect, categorize, and evaluate hallucinations in LVLMs for medical tasks.
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.
Motivation & Objective
- Advance reliable medical LVLMs by addressing hallucinations in medical visual-language outputs.
- Provide a domain-specific benchmark for detection and evaluation of medical hallucinations.
- Develop a hierarchical categorization and metrics to measure clinical impact of hallucinations.
- Create a detector model to detect and classify hallucinations in medical LVLM outputs.
- Offer baselines and insights across medical VQA and imaging report generation tasks.
Proposed method
- Introduce Med-HallMark with multi-task hallucination support, multifaceted data, and hierarchical categorization.
- Define a five-level medical hallucination taxonomy: Catastrophic, Critical, Attribute, Prompt-induced, Minor, plus Correct statements.
- Propose MediHall Score that assigns numeric scores per hallucination type and aggregates for Med-VQA and IRG tasks.
- Develop MediHallDetector, a multimodal detector built on LLaVA with a dual-layer classifier and fine-tuning on medical image-text data plus Med-HallMark data.
- Use single-stage supervised fine-tuning with combined data sources to train MediHallDetector.

Experimental results
Research questions
- RQ1How can we reliably detect and categorize medical hallucinations in LVLM outputs across medical VQA and imaging report generation tasks?
- RQ2Can a domain-specific benchmark and scoring metric better reflect the clinical impact of hallucinations than traditional NLP metrics?
- RQ3Does a specialized detector improve accuracy and consistency in identifying hallucination types compared to generic LLM assessments?
- RQ4What baselines do contemporary medical LVLMs achieve on the proposed Med-HallMark benchmark, and how do MediHall Score and MediHallDetector perform relative to them?
Key findings
- Med-HallMark provides a comprehensive benchmark with multi-task support, multifaceted data, and hierarchical hallucination categorization for medical LVLMs.
- MediHall Score offers a nuanced, hierarchy-based evaluation of hallucinations, improving upon traditional metrics in reflecting clinical impact.
- MediHallDetector demonstrates superior detection performance and evaluation consistency compared to GPT-3.5 and GPT-4 baselines in human-preferred hallucination levels.
- Across Med-VQA and IRG tasks, traditional metrics often fail to capture factual correctness or degree of hallucination, whereas MediHall Score aligns better with hallucination severity.
- MediHallDetector achieves higher agreement with human preferences and faster inference times than comparative LLM-based evaluation approaches.
- Ablation studies show mixing diverse task data in a single SFT phase yields best performance for MediHallDetector.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.