QUICK REVIEW

[论文解读] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin|arXiv (Cornell University)|Jun 26, 2023

Multimodal Machine Learning Applications被引用 24

一句话总结

本文提出 LRV-Instruction，这是一个包含正样本和负样本的大规模视觉指令数据集，以及 GAVIE，一种利用 GPT-4 辅助的评估方法，用以通过在鲁棒指令数据上微调来测量并减轻大规模多模态模型的幻觉。

ABSTRACT

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

研究动机与目标

在遵循人类指令时，激发并解决大型多模态模型（LMMs）的幻觉问题。
创建一个涵盖 16 个 VL 任务、包含正负样本的大规模、多样化视觉指令数据集。
开发一个评估框架（GAVIE），在没有地面真相答案的条件下衡量指令遵循准确性和视觉幻觉。
证明在 LRV-Instruction 上对 LMMs 进行微调可以减少幻觉并提升在公开基准上的表现。

提出的方法

利用 400k 个 GPT-4–生成的跨 16 个 VL 任务的视觉指令构建 LRV-Instruction，其中在三个语义层次（不存在的对象操作、存在对象操作、知识操作）包含负向指令。
以陈述式和疑问式形式生成负向指令，教导模型避免幻觉并说 'Yes'。

实验结果

研究问题

RQ1当前的 LMMs 在面对负向指令时会如何产生幻觉？
RQ2在 LRV-Instruction 上对 LMMs 进行微调是否能在减少视觉幻觉的同时保持或提升任务性能？
RQ3正负训练样本的平衡混合是否能产生更鲁棒的视觉指令遵循模型？
RQ4在没有地面真相答案的情况下，GPT4-Assisted Visual Instruction Evaluation (GAVIE) 将模型输出与人类判断对齐的效果如何？
RQ5经过指令微调的模型是否能泛化到超出 LRV-Instruction 评估集的公开 VL 基准？

主要发现

现有的 LMMs 在面对负向指令时会出现显著的幻觉，尤其是在 Existent Object 和 Knowledge Manipulation。
在 LRV-Instruction 上对 MiniGPT4 和 mPLUG-Owl 进行微调，减少幻觉并在公开数据集上提升性能，相较于若干最先进基线。
训练中正负数据比例的平衡会在正向和负向指令下都实现稳健的指令遵循行为。
GAVIE 提供稳定的、无地面真相的评估，与人类判断在相关性和模型输出准确性方面具有相关性。
LRV-Instruction 支持超越模板化指令数据的开放式评估和鲁棒性提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。