QUICK REVIEW

[論文レビュー] Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang|arXiv (Cornell University)|May 5, 2023

Topic Modeling被引用数 87

ひとこと要約

Otterは、新しいマルチモーダルのインコンテキスト指示データセット上でOpenFlamingoを微調整し、指示追従とインコンテキスト学習を改善すると同時に、訓練要件を削減し、Hugging Faceとの統合を実現します。

ABSTRACT

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the extbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the extbf{MIMIC-IT} ( extbf{M}ult extbf{I}- extbf{M}odal extbf{I}n- extbf{C}ontext extbf{I}nstruction extbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

研究の動機と目的

マルチモーダルモデルの指示調整を動機づけ、指示追従と文脈ベースの学習を改善する。
自然なクロスモーダル整合を可能にするために、交互に配置されたマルチモーダル事前学習データを活用する。
研究者向けの実用的でリソース効率の高い微調整ワークフローを提供する。
OtterをHugging Faceと統合し、ハードウェア要件を削減してアクセスを民主化する。

提案手法

文脈的なインコンテキスト例を含む画像-指示-回答の三つ組からなるMIMIC-ITデータセットを導入する。
視覚エンコーダと言語デコーダを凍結しつつ、クロスアテンションとPerceiverリサンプラーモジュールを訓練して、OpenFlamingoベースを微調整し、約1.3Bの学習可能パラメータを生み出す。
指示追従とインコンテキスト学習を訓練するために、特殊トークンを用いたチャットボット風の訓練形式を使用する。
Cosine学習率減衰と勾配クリッピングを用いて、4つのGPUで6エポック、AdamWで訓練する。
OtterをHugging Face Transformersに統合し、OpenFlamingoのチェックポイント用変換スクリプトを提供する。）

Figure 1 : Otter Overview . Otter is a multi-modal model finetuned on our proposed MIMIC-IT dataset, based on OpenFlamingo. Otter model exhibits the improved ability to execute tasks by following given instructions and leveraging in-context examples.

実験結果

リサーチクエスチョン

RQ1MIMIC-ITを介したマルチモーダル指示調整は、マルチモーダルモデルにおける明示的な指示遵守を改善できるか？
RQ2インコンテキスト学習は、少数の例示でOtterが新しい指示を実行することを可能にするか？
RQ3強力なマルチモーダル指示追従を達成するための実用的な訓練リソース要件は何か？
RQ4指示追従と場面理解の点で、OtterはOpenFlamingoとどう比較されるか？
RQ5研究者にとってOpenFlamingoアーキテクチャをよりアクセスしやすくするにはどうすればよいか？

主な発見

OtterはMIMIC-ITで微調整した後、OpenFlamingoより指示追従能力の改善を示す。
Otterは提供されたインコンテキスト例を用いて新しい指示を実行することを学習できる。
最適化により訓練要件を4× RTX3090 GPUに削減し、Hugging Face Transformersへの統合を可能にする。
定性的分析では、ベースラインと比較してより深い場面理解と常識推論を示す。
Otterは、モデルハブの配置や変換スクリプトを含む、アクセスしやすいツールとともに公開される。

Figure 2 : Illustration of example data formats in MMC4 and MIMIC-IT . (a) The illustration of the data format in the MMC4 dataset that are used OpenFlamingo. (b) Three heuristics to build the multi-modal in-Context instruction tuning (MIMIC-IT) dataset.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。