QUICK REVIEW

[論文レビュー] Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Juncheng Li, Kaihang Pan|arXiv (Cornell University)|Aug 8, 2023

Multimodal Machine Learning Applications被引用数 11

ひとこと要約

軽量な Visual Prompt Generator Complete (VPG-C) モジュールを提案し、マルチモーダル LLM がゼロショットのデモンストレーション指示に従えるようにする合成的識別訓練戦略を特徴とし、評価のための DEMON ベンチマークを導入する。

ABSTRACT

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based training objective inherently biases the VPG to concentrate solely on the primary visual contents sufficient for caption generation, often neglecting other visual details. This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task. To address this issue, we introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C), which can infer and complete the missing details essential for comprehending demonstrative instructions. Further, we propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions. As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and OwlEval benchmarks also demonstrate the superiority of VPG-C. Our benchmark, code, and pre-trained models are available at https://github.com/DCDmllm/Cheetah.

研究の動機と目的

主要な内容を超えた、混在するマルチモーダルデモンストレーションをモデルが理解する必要性を動機づける。
デモンストレーション指示の欠落した視覚的詳細を推測・補完する、軽量で汎用的な VPG-C モジュールを導入する。
監督付きデモンストラティブ指示データを必要としない、合成的識別訓練戦略を開発する。
MLLM におけるデモンストラティブ指示理解を評価する総合ベンチマーク DEMON を作成・公開する。

提案手法

凍結済みの LLM（Vicuna-7B）と視覚エンコーダ（EVA-CLIP）を Q-Former を基盤とする VPG のベースとして用いる。
VPG-C は中間の LLM 出力から指示特定のガイダンスを派生させ、残差視覚プロンプトを生成する。
残差プロンプトはスキップ接続を介して再度結合され、多モーダル表現を拡張する。
合成識別訓練を通じて VPG-C パラメータのみを訓練する（モデルの 0.09%）
合成訓練編集はクロスアテンションマップによって画像領域を無視し、合成画像ペアを作成し、差異を説明するようモデルを訓練する。

実験結果

リサーチクエスチョン

RQ1VPG-C はラベル付きデモンストレーションデータなしに、デモンストラティブで混在するマルチモーダル指示をゼロショットで理解できるだろうか？
RQ2従来のVPGと比較して、VPG-C の合成識別訓練は欠落した視覚的詳細の扱いを改善するか？
RQ3既存のマルチモーダルベンチマーク（MME, OwlEval）および新たに導入された DEMON ベンチマークでの VPG-C の性能はどうか？
RQ4ガイダンスと残差の詳細を LLM/VPG パイプラインのどの位置に注入すると最良の性能を得られるか？

主な発見

VPG-C は DEMON のタスクカテゴリ全体で既存のマルチモーダル LLM を一貫して上回る。
VPG-C モジュールを用いた合成訓練データは、画像キャプションデータのみでの訓練に比べ顕著な向上をもたらす。
VPG-C は、軽量な 6.3M パラメータのモジュールのみを微調整することで実質的な改善を達成する。
ゼロショット評価は、MME や OwlEval などの追加ベンチマークで高いパフォーマンスを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。