QUICK REVIEW

[论文解读] Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du|arXiv (Cornell University)|May 17, 2023

Ferroelectric and Negative Capacitance Devices被引用 12

一句话总结

本论文系统性地研究了 LVLMs 的对象幻觉，并提出了 POPE，一种基于投票的评估方法，相较于现有方法更稳定、可扩展，显示 LVLMs 常对常见对象和共现对象产生幻觉。

ABSTRACT

Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.

研究动机与目标

推动对大规模视觉-语言模型（LVLMs）中对象幻觉的研究。
在 MSCOCO 上对具有代表性的 LVLMs 进行对象幻觉的定量评估。
分析视觉指令数据如何影响幻觉行为。
提出并验证一种基于投票的评估方法（POPE）用于稳定的幻觉评估。
展示 POPE 在跨数据集和基于分割的设置中的可扩展性与鲁棒性。

提出的方法

改用 CHAIR 指标来衡量 LVLM 生成的 MSCOCO 标题中的对象幻觉。
对五种 LVLM（mPLUG-Owl、LLaVA、Multimodal-GPT、MiniGPT-4、InstructBLIP）进行图像描述任务的提示。
引入 POPE：基于投票的探测，将幻觉评估转化为关于对象是否存在的是/否问题。
使用随机、流行和对抗抽样构造探测集，以测试对象幻觉的鲁棒性。
将 POPE 与 CHAIR 进行比较，并在不同提示和标题长度下评估稳定性。
可选地通过基于 SEEM 的分割将 POPE 扩展到未标注的数据集并比较结果。

实验结果

研究问题

RQ1现有 LVLMs 在字幕中幻觉对象的程度与 MSCOCO 的真实对象相比有多大？
RQ2在使用 CHAIR 时，指令设计和字幕长度如何影响幻觉测量？
RQ3基于投票的探测方法（POPE）在评估 LVLM 的对象幻觉方面是否更稳定、可扩展？
RQ4在视觉指令数据中经常出现或共现的对象会推动 LVLM 的幻觉吗？

主要发现

数据集	设置	模型	准确率	精确度	召回率	F1 分数	是（百分比）
MSCOCO	Random	mPLUG-Owl	53.30	51.71	99.53	68.06	96.23
MSCOCO	Random	LLaVA	54.43	52.32	99.80	68.65	95.37
MSCOCO	Random	MultiModal-GPT	50.03	50.02	100.00	66.68	99.97
MSCOCO	Random	MiniGPT-4	77.83	75.38	82.67	78.86	54.83
MSCOCO	Popular	mPLUG-Owl	50.63	50.32	99.27	66.79	98.63
MSCOCO	Popular	LLaVA	52.43	51.25	99.80	67.72	97.37
MSCOCO	Popular	MultiModal-GPT	50.00	50.00	100.00	66.67	100.00
MSCOCO	Popular	MiniGPT-4	68.30	64.27	82.40	72.21	64.10
MSCOCO	Popular	InstructBLIP	—	—	—	—	—
MSCOCO	Adversarial	mPLUG-Owl	50.67	50.34	99.33	66.82	98.67
MSCOCO	Adversarial	LLaVA	50.77	50.39	99.87	66.98	99.10
MSCOCO	Adversarial	MultiModal-GPT	50.00	50.00	100.00	66.67	100.00
MSCOCO	Adversarial	MiniGPT-4	66.60	62.45	83.27	71.37	66.67
MSCOCO	Adversarial	InstructBLIP	74.37	67.67	93.33	78.45	68.97

LVLMs 表现出强烈的对象幻觉，常比小型 VLPMs 更明显，CHAIR 结果显示出实例级和句子级的幻觉很高。
指令提示设计和字幕长度显著影响 CHAIR 分数，表明 CHAIR 作为评估指标存在不稳定性。
POPE 提供了更稳定且灵活的评估：是/否探测降低了解释偏差并与字幕内容对齐。
LVLMs 往往会幻觉出在视觉指令数据中经常出现或与真实对象经常共现的对象。
在 MSCOCO 的 Random、Popular、Adversarial 设置下，InstructBLIP 通常表现最好，而 LLaVA、MultiModal-GPT 和 mPLUG-Owl 则显示出更强的幻觉倾向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。