QUICK REVIEW

[论文解读] Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Juncheng Li, Kaihang Pan|arXiv (Cornell University)|Aug 8, 2023

Multimodal Machine Learning Applications被引用 11

一句话总结

提出了 VPG-C，一种轻量级的 Visual Prompt Generator Complete 模块，采用合成判别式训练策略，使多模态大型语言模型能够遵循零样本演示指令，并引入用于评估的 DEMON 基准。

ABSTRACT

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based training objective inherently biases the VPG to concentrate solely on the primary visual contents sufficient for caption generation, often neglecting other visual details. This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task. To address this issue, we introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C), which can infer and complete the missing details essential for comprehending demonstrative instructions. Further, we propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions. As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and OwlEval benchmarks also demonstrate the superiority of VPG-C. Our benchmark, code, and pre-trained models are available at https://github.com/DCDmllm/Cheetah.

研究动机与目标

阐明需要让模型理解超出主要内容的交错多模态演示的必要性。
引入一个轻量级、通用的 VPG-C 模块，用于推断并补全演示指令中的缺失视觉细节。
开发一种无需有监督演示指令数据的合成判别式训练策略。
创建并发布 DEMON，一个综合基准，用于评估多模态语言模型在演示指令理解方面的能力。

提出的方法

使用冻结的 LLM（Vicuna-7B）和带 Q-Former 的视觉编码器（EVA-CLIP），作为基础 VPG。
VPG-C 从中间 LLM 输出中推导指令特定的引导，并生成残差视觉提示。
通过跳跃连接将残差提示重新合并，以增强多模态表示。
仅通过合成判别式训练训练 VPG-C 的参数（占模型的 0.09%）。
合成训练编辑通过跨注意力图忽略图像区域，创建合成图像对，并训练模型描述差异。

实验结果

研究问题

RQ1在没有带标签的演示数据的情况下，VPG-C 是否能够实现对演示性、交错的多模态指令的零样本理解？
RQ2与传统 VPG 相比，VPG-C 的合成判别式训练是否改善了对缺失视觉细节的处理？
RQ3VPG-C 在现有的多模态基准（MME、OwlEval）以及新引入的 DEMON 基准上的表现如何？
RQ4在 LLM/VPG 流水线的何处注入引导和残差细节以获得最佳性能？

主要发现

VPG-C 在 DEMON 任务类别中始终超越现有的多模态语言模型。
使用 VPG-C 模块的合成训练数据相较于仅在图像-字幕数据上训练取得显著提升。
VPG-C 在仅微调一个轻量级的 6.3M 参数模块的情况下实现了显著提升。
零样本评估在如 MME、OwlEval 的额外基准上显示出强劲表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。