QUICK REVIEW

[論文レビュー] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui|arXiv (Cornell University)|Oct 1, 2023

Multimodal Machine Learning Applications被引用数 29

ひとこと要約

本論文は LVLM Hallucination Revisor (LURE) を導入する。これは post-hoc 手法で、co-occurrence、uncertainty、object position などの要因を用いて幻覚的な記述を再記述することを学習することで、大規模視覚言語モデルのオブジェクト幻覚を低減し、six LVLMs にわたって substantial improvements を達成する。

ABSTRACT

Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.

研究の動機と目的

大規模視覚言語モデル（LVLMs）におけるオブジェクト幻覚に対応する必要性を動機づける。
幻覚の基礎となる三つの主要な要因を特定する：co-occurrence、uncertainty、および object position。
追加のファインチューニングデータを必要とせず、軽量な post-hoc 幻覚リビジョンツール（LURE）を提案する。
LURE がさまざまな LVLMs と統合され、顕著な幻覚低減をもたらすことを示す。

提案手法

アプローチを LVLMs における幻覚原因の統計分析（co-occurrence、uncertainty、position）に基づかせる。
正しいキャプションを GPT-3.5 で修正して、可能性の高い共起と不確実／終端オブジェクトのプレースホルダを挿入することで、幻覚データセットを構築する。
自己回帰損失を用いて、幻覚的記述を正確な記述へ写像する幻覚リビジョン Rθ を訓練する。
推論時に [IDK] プレースホルダを挿入して再評価を促すことで、訓練済みリビジョンを任意の LVLM に組み込む。
CHAIR 指標、GPT-based ランキング、および人間の判断を用いて six open-source LVLMs で LURE を評価する。

実験結果

リサーチクエスチョン

RQ1LURE は強力なベースラインと比較して LVLMs のオブジェクト幻覚を低減できるか？
RQ2特定された幻覚要因（co-occurrence、uncertainty、position）が改善に意味のある寄与をするか？
RQ3LURE は異なる LVLM Backbone および不確実性閾値に対して頑健か？

主な発見

LURE は general object hallucination 指標で前のベスト手法より 23% の改善を達成。
LURE は GPT-based および人間の評価の両方で一貫してトップにランク付けされる。
アブレーションにより、co-occurrence、uncertainty、および position の三要因すべてが性能向上に寄与することが示されている。
追加データを用いたファインチューニングと比較して、データ拡張ベースラインを上回り、post-hoc 修正の有効性を示している。
LURE は MiniGPT-4、LLaMA-adapter、および mPLUG-Owl などのバックボーンで頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。