QUICK REVIEW

[論文レビュー] LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang|arXiv (Cornell University)|Jul 10, 2024

Simulation and Modeling Applications被引用数 21

ひとこと要約

LLaVA-NeXT-Interleave は、マルチ画像、動画、3D、および単一画像タスクを相互に組み合わせたデータ形式に統合し、M4-Instruct で訓練し、新しい LLaVA-Interleave Bench で評価することで、単一画像の性能を保ちながら M4 シナリオ全般で最先端の結果を達成する。

ABSTRACT

Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

研究の動機と目的

単一の LMM がマルチ画像、動画、3D、および単一画像タスク（M4）を扱えるよう動機づけ、実現させる。
多様なタスクを一つのフレームワークに統合する相互組み合わせデータテンプレートを提案する。
訓練と評価を跨ぐために M4-Instruct データセットと LLaVA-Interleave Bench を作成・精選する。

提案手法

視覚エンコーダー、中間プロジェクター、LLMコアを備えた LLaVA-NeXT-Image アーキテクチャを採用する。
3つの訓練技術を導入する： (1) 強力な単一画像モデルからの継続学習、(2) 混合された相互組み合わせデータ形式（in-front vs interleaved）、(3) 4つのデータシナリオ（multi-image, multi-frame, multi-view, multi-patch）での共同訓練。
M4-Instruct を M4 ドメイン全体で 14 タスクと 41 データセットにまたがる 1,177.6K サンプルで構築する; 新しいタスクは GPT-4V で注釈付け。
LLaVA-Interleave Bench を 13 タスクと 17K インスタンスで開発し、インドメイン評価とアウトドメイン評価に分割する。
マルチ画像、動画、3D のベンチマーク全体を評価するとともに、単一画像の性能も維持する。）

実験結果

リサーチクエスチョン

RQ1相互組み合わせされたマルチ画像データで訓練された単一の LMM は、マルチ画像タスクで競争力のある性能を発揮し、動画および3Dのシナリオへ一般化できるか？
RQ2相互組み合わせデータ形式は、タスク間の転移やモダリティを横断する新たな能力を可能にするか？
RQ3強力な単一画像のチェックポイントから初期化することが、マルチ画像の微調整性能にどのような影響を与えるか？
RQ4入力トークンの配置（in-front vs interleaved）および混合形式訓練が、堅牢性とタスク性能に与える影響はどの程度か？

主な発見

モデル	インドメイン平均	アウトドメイン平均	IE	VST	TRVQA	MIVQA	パズル	QB	NLVR2	インドメイン平均	Math	Sci	Mantis	BLINK	MMMU-mv	備考
GPT-4V	39.2	12.5	11.0	10.9	54.5	52.0	17.1	76.5	88.8	57.8	60.3	66.9	62.7	51.1	47.9	—
LLaVA-NeXT-Image (7B)	32.4	12.9	13.2	10.1	59.6	39.4	9.0	51.0	68.0	29.4	13.5	12.2	46.1	41.8	33.5	—
VPG-C (7B)	35.8	27.8	15.2	21.5	38.9	46.8	2.4	57.6	73.2	34.5	24.3	23.1	52.4	43.1	29.4	—
Mantis (7B)	39.6	17.6	11.2	12.5	45.2	52.5	25.7	69.9	87.4	39.3	27.2	29.3	59.5	46.4	34.1	—
LLaVA-NeXT-Interleave (0.5B)	43.9	34.3	21.6	29.7	63.9	54.8	35.4	52.0	67.8	33.1	13.3	12.2	45.6	39.2	28.6	—
LLaVA-NeXT-Interleave (7B)	58.6	37.1	24.3	33.1	76.1	87.5	48.7	74.2	88.8	42.8	32.8	31.6	62.7	52.6	34.5	—
LLaVA-NeXT-Interleave (14B)	62.3	40.5	24.5	33.3	78.6	95.0	59.9	76.7	91.1	44.3	33.4	32.7	66.4	52.1	37.1	—
Notes (examples)	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—

LLaVA-NeXT-Interleave は、サイズ (0.5B, 7B, 14B) にわたって単一画像の性能を維持しつつ、マルチ画像ベンチマークで最先端の結果を達成する。
相互組み合わせデータテンプレートと共訓練の M4-Instruct により、単一画像からマルチ画像推論への転移や画像から動画へのタスク転移といったクロスタスク転移が可能になる。
混合形式で動画およびマルチ画像データを追加することで、全体の指標とタスク間の堅牢性が向上する。
設定やモダリティを横断するタスク転移などの新たな能力を示す（例：違いを見つけるタスクから動画へ、動画からTwitter投稿を生成する等）。
LLaVA-Interleave Bench は、インド-domainおよびアウトドメインのタスクを網羅した評価を提供し、未見のマルチ画像シナリオへの一般化を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。