QUICK REVIEW

[논문 리뷰] Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du|arXiv (Cornell University)|2023. 05. 17.

Ferroelectric and Negative Capacitance Devices인용 수 12

한 줄 요약

이 논문은 LVLM에서 객체 환각을 체계적으로 연구하고, 이전 방법보다 더 안정적이고 확장 가능한 폴링 기반 평가 방식인 POPE를 도입합니다. 이를 통해 LVLM이 일반적인 객체와 동시 출현 객체를 자주 환각한다는 것을 보입니다.

ABSTRACT

Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.

연구 동기 및 목표

대형 비전-언어 모델(LVLM)에서 객체 환각 현상을 연구하는 동기를 제시합니다.
MSCOCO에서 대표적인 LVLM들에 대한 환각 심각도를 정량적으로 평가합니다.
시각적 지시 데이터가 환각 행동에 어떻게 영향을 미치는지 분석합니다.
안정한 환각 평가를 위한 폴링 기반 평가 방법(POPE)을 제안하고 검증합니다.
데이터셋과 세그먼트 기반 설정 전반에 걸쳐 POPE의 확장성 및 신뢰성을 시연합니다.

제안 방법

MSCOCO에서 LVLM이 생성한 캡션의 객체 환각을 측정하기 위해 CHAIR 지표를 재목적화합니다.
이미지 캡션 작업으로 다섯 개의 LVLM(mPLUG-Owl, LLaVA, Multimodal-GPT, MiniGPT-4, InstructBLIP)을 프롬프트합니다.
POPE를 도입합니다: 환각 평가를 객체 존재 여부에 대한 예/아니오 질문으로 바꾸는 폴링 기반 probing으로 변경합니다.
Random, Popular, Adversarial 샘플링을 사용하여 객체 환각의 강건성을 테스트하는 probing 세트를 구성합니다.
POPE를 CHAIR와 비교하고 다양한 프롬프트 및 캡션 길이하에서의 안정성을 평가합니다.
선택적으로 SEEM 기반 세분화로 주석이 없는 데이터셋에 POPE를 확장하고 결과를 비교합니다.

실험 결과

연구 질문

RQ1MSCOCO에서 기존의 LVLM이ground-truth 객체에 비해 캡션에서 객체를 환각하는 정도는 어느 정도인가?
RQ2CHAIR를 사용할 때 지시 설계와 캡션 길이가 환각 측정에 어떤 영향을 미치는가?
RQ3폴링 기반 조사를 통한 접근법(POPE)이 LVLM의 객체 환각 평가에 대해 더 안정적이고 확장 가능한가?
RQ4시각적 지시 데이터에서 자주 등장하거나 서로 함께 나타나는 객체가 LVLM의 환각을 유도하는가?

주요 결과

데이터세트	설정	모델	정확도	정밀도	재현율	F1 점수	예(%)
MSCOCO	Random	mPLUG-Owl	53.30	51.71	99.53	68.06	96.23
MSCOCO	Random	LLaVA	54.43	52.32	99.80	68.65	95.37
MSCOCO	Random	MultiModal-GPT	50.03	50.02	100.00	66.68	99.97
MSCOCO	Random	MiniGPT-4	77.83	75.38	82.67	78.86	54.83
MSCOCO	Popular	mPLUG-Owl	50.63	50.32	99.27	66.79	98.63
MSCOCO	Popular	LLaVA	52.43	51.25	99.80	67.72	97.37
MSCOCO	Popular	MultiModal-GPT	50.00	50.00	100.00	66.67	100.00
MSCOCO	Popular	MiniGPT-4	68.30	64.27	82.40	72.21	64.10
MSCOCO	Popular	InstructBLIP	—	—	—	—	—
MSCOCO	Adversarial	mPLUG-Owl	50.67	50.34	99.33	66.82	98.67
MSCOCO	Adversarial	LLaVA	50.77	50.39	99.87	66.98	99.10
MSCOCO	Adversarial	MultiModal-GPT	50.00	50.00	100.00	66.67	100.00
MSCOCO	Adversarial	MiniGPT-4	66.60	62.45	83.27	71.37	66.67
MSCOCO	Adversarial	InstructBLIP	74.37	67.67	93.33	78.45	68.97

LVLM은 강한 객체 환각 경향을 보이며 종종 소형 VLPM보다 더 큰 환각을 보이고, CHAIR 결과는 인스턴스 수준 및 문장 수준의 환각을 나타낸다.
지시 프롬프트 설계와 캡션 길이가 CHAIR 점수에 큰 영향을 미치며, CHAIR가 평가 지표로써 불안정함을 시사한다.
POPE는 더 안정적이고 유연한 평가를 제공한다: Yes/No probing이 파싱 편향을 줄이고 캡션 내용과 일치한다.
LVLM은 시각적 지시 데이터에 자주 등장하거나 ground-truth 객체와 자주 동시 등장하는 객체를 환각하는 경향이 있다.
MSCOCO에서 Random, Popular, Adversarial 설정에서 InstructBLIP이 일반적으로 가장 잘 작동하는 반면, LLaVA, MultiModal-GPT, 및 mPLUG-Owl은 더 강한 환각 경향을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.