QUICK REVIEW

[論文レビュー] MIMIC-IT: Multi-Modal In-Context Instruction Tuning

Bo Li, Yuanhan Zhang|arXiv (Cornell University)|Jun 8, 2023

Multimodal Machine Learning Applications被引用数 28

ひとこと要約

tldr: MIMIC-IT は in-context 多模態情報を備えた 2.8M のマルチモーダル指示チューニングデータセットを導入し、Otter を訓練（OpenFlamingo 系の VLM）し、ベンチマーク全体で強力な認識、推論、そして in-context 学習を示します。

ABSTRACT

High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.

研究の動機と目的

視覚言語モデルのゼロショット一般化を目指して、高品質で多様なマルチモーダル指示遵守データを作成する。
複数の言語で画像と動画を横断するマルチモーダルな in-context 指示-応答ペアを含むデータセットを構築する。
視覚的文脈に guided される instruction-response ペアを生成する自動生成パイプライン Syphus を開発する。
MIMIC-IT 上で多模態モデル（Otter）を訓練し、その認識、推論、および in-context 学習能力を評価する。
データセット、注釈パイプライン、ベンチマーク、および Otter モデルをコミュニティに公開する。）

提案手法

in-context 情報を含む複数画像/動画入力のデータ形式を定義する: d_q = (I_q, R_q, X_q, C_ψ(I_q, X_q)).
システムメッセージ、視覚的注釈、in-context の例を用いて ChatGPT/GPT-4 に instruction-response ペアを生成させる自動生成パイプライン Syphus を作成する。
すべての指示/応答を eight 言語に翻訳して多言語利用を可能にする。
seven diverse visual datasets (indoor, outdoor, egocentric, etc.) をキュレーションしてデータセットを多様なシーンで構築する。
OpenFlamingo をベースとした Otter を MIMIC-IT で訓練し、MMAGIBench および Multi-Modality Arena で評価する。
few-shot in-context learning テストを COCO Caption および人間に合わせた評価を含む評価フレームワークを提供する。

実験結果

リサーチクエスチョン

RQ1large-scale multimodal in-context instruction-tuning データセットは vision-language モデルのゼロショット一般化をどう改善するか？
RQ2multi-modal in-context information（複数の画像/動画）は指示遵守性能にどのような影響を与えるか？
RQ3end-to-end trainable multi-modal モデル（Otter）は多様なタスクで強力な認識、推論、in-context 学習を達成できるか？
RQ4指示-応答を eight languages に翻訳することから生じる多言語の利点は何か？
RQ5Otter は標準ベンチマークと人間評価で contemporary VLMs と比較してどうか？

主な発見

Model	Lang.	Decoder	Avg.	Coarse	Finegrained	Attribute	Relation	Future Pred.
InstructBLIP	Vicuna-7B	-	50.4	67.8	52.2	43.8	38.2	50.0
MiniGPT-4	Vicuna-7B	-	51.0	63.3	47.8	50.6	26.5	66.7
OpenFlamingo	LLaMA-7B	-	51.1	34.4	40.0	61.3	52.9	66.7
LLaVA	Vicuna-7B	-	62.7	44.4	54.2	71.9	76.5	66.7
Otter	LLaMA-7B	-	65.5	68.9	47.3	66.3	61.8	83.3

Otter は MMAGIBench の認識・推論ベンチマークで評価対象の VLM の中でトップパフォーマンスを達成した。
Human evaluation (Multi-Modality Arena) において Otter は最近の VLM に対する Elo レーティングで最高を示し、有用性と整合性が高いことを示した。
Otter は COCO Caption (CIDEr) での few-shot in-context learning が OpenFlamingo と比べて優れていることを示した。
データセットは 2.8M を超える instruction-response ペアを含み、general scenes の 2.2M unique instructions を含み、八言語の in-context 情報を含む。
Sythus はシステムプロンプト、視覚注釈、および in-context exemplars を組み合わせることで高品質な多言語 instruction-response 生成を可能にする。
Otter は複数ラウンドの会話、シーン理解、および egocentric visual assistant 機能（Otter-E for AR headsets）をサポートする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。