QUICK REVIEW

[論文レビュー] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin|arXiv (Cornell University)|Jun 26, 2023

Multimodal Machine Learning Applications被引用数 24

ひとこと要約

この論文は LRV-Instruction、正例と負例を含む大規模な視覚指示データセット、および GAVIE、巨大多モーダルモデルの幻覚を測定・緩和するための GPT-4支援評価法を提案し、堅牢な指示データでファインチューニングする。

ABSTRACT

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

研究の動機と目的

人間の指示に従う際の大規模多モーダルモデル（LMMs）における幻覚を動機づけ、対処する。
16のVLタスクに跨る正例・負例を含む、大規模で多様な視覚指示データセットを作成する。
正解がない状態で、指示遵守の正確さと視覚的幻覚を評価する評価フレームワーク（GAVIE）を開発する。
LRV-InstructionでLMMをファインチューニングして幻覚を減らし、公開ベンチマークでの性能を改善することを実証する。

提案手法

16のVLタスクに跨る400kのGPT-4–生成視覚指示を用いてLRV-Instructionを構築し、三つの意味レベル（Nonexistent Object Manipulation, Existent Object Manipulation, Knowledge Manipulation）で負の指示を含める。
負の指示を宣言形式と疑問形式の両方で生成し、幻覚を避けるようモデルに教え、そして say ", "Yes""

実験結果

リサーチクエスチョン

RQ1現在のLMMは、異なる意味レベルの負の指示を受けたときにどのように幻覚を生み出すか？
RQ2LRV-InstructionでLMMをファインチューニングして視覚的幻覚を減らし、タスク性能を維持または向上させられるか？
RQ3正と負のトレーニングサンプルをバランス良く混ぜると、より堅牢な視覚指示追従モデルが得られるか？
RQ4GPT4-Assisted Visual Instruction Evaluation (GAVIE) が、正解が与えられていない状態でモデル出力を人間判断と整合させるうえでどの程度有効か？
RQ5指示チューニングされたモデルは、LRV-Instruction評価セットを超える公開VLベンチマークに一般化するか？

主な発見

負の指示に直面したとき、特に Existent Object と Knowledge Manipulation において、既存のLMMは顕著な幻覚を示す。
LRV-Instructionで MiniGPT4 および mPLUG-Owl をファインチューニングすると、幻覚を減らし、公開データセットでの性能を、いくつかの最先端ベースラインと比較して改善する。
トレーニングにおける正/負データ比がバランスされていると、正・負の両指示に対して堅牢な指示追従行動を生む。
GAVIEは、モデル出力の関連性と正確性について人間の判断と相関する、安定した、ground-truth不要の評価を提供する。
LRV-Instructionは、テンプレートベースの指示データを超えたオープンエンド評価と堅牢性の改善を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。