QUICK REVIEW

[論文レビュー] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Zhenfei Yin, Jiong Wang|arXiv (Cornell University)|Jun 11, 2023

Multimodal Machine Learning Applications被引用数 40

ひとこと要約

LAMM は、画像と点群タスクのためのオープンソースのマルチモーダル指示調整データセット、フレームワーク、ベンチマークを導入し、統一されたインターフェースとオープンコードでMLLMの訓練と評価を可能にします。

ABSTRACT

Large language models have emerged as a promising approach towards achieving general-purpose AI agents. The thriving open-source LLM community has greatly accelerated the development of agents that support human-machine dialogue interaction through natural language processing. However, human interaction with the world extends beyond only text as a modality, and other modalities such as vision are also crucial. Recent works on multi-modal large language models, such as GPT-4V and Bard, have demonstrated their effectiveness in handling visual modalities. However, the transparency of these works is limited and insufficient to support academic research. To the best of our knowledge, we present one of the very first open-source endeavors in the field, LAMM, encompassing a Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs, with a specific focus on facilitating AI agents capable of bridging the gap between ideas and execution, thereby enabling seamless human-AI interaction. Our main contribution is three-fold: 1) We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We outline the detailed methodology of constructing multi-modal instruction tuning datasets and benchmarks for MLLMs, enabling rapid scaling and extension of MLLM research to diverse domains, tasks, and modalities. 3) We provide a primary but potential MLLM training framework optimized for modality extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research. Our baseline model is trained within 24 A100 GPU hours, framework supports training with V100 and RTX3090 is available thanks to the open-source society.

研究の動機と目的

ビジョンと言語の両方を扱うオープンで透明性のあるマルチモーダルLLMの必要性を喚起する。
豊富で細粒度の注釈を含む大規模な画像と3D点群の指示調整データセットを提供する。
2Dおよび3Dビジョンタスクに対するゼロショットおよびファインチューニング性能を定量化するオープンなベンチマークと評価プロトコルを確立する。
追加のモダリティとタスクへ拡張できるモジュール式の訓練フレームワークを提案する。
このフレームワークとデータセットで訓練されたベースラインモデルを示し、MLLMのオープンな研究を促進する。

提案手法

186,098組の画像-言語ペアと10,262組の点群-言語ペアデータセットを構築する。
GPT-APIを用いて、システムメッセージ、インコン텍スト学習ペア、および真実値に近い注釈を得るためのクエリを含む指示-応答テンプレートを生成する。
ビジョンタスクの注釈を指示-応答ペアに変換し、MLLMの指示理解を向上させる。
各モダリティに専用のエンコーダ、プロジェクター、およびモダリティ固有のLoRAアダプターを用意しつつ、共通のLLMを共有する訓練フレームワークを提案する。
モダリティ固有のプロジェクションとLoRAを用いて、Vicuna-13BのベースラインMLLMをエンドツーエンドで訓練する。4つのA100 GPUを使用。
伝統的な指標、Binary Locating Metric、そしてGPTベースの評価指標を含む新しいベンチマークで2Dおよび3Dタスクを評価する。

実験結果

リサーチクエスチョン

RQ1統一された指示調整アプローチを用いて、オープンソースMLLMが多様な2Dおよび3Dビジョンタスクでゼロショットをどの程度達成できるか？
RQ2モジュール式のマルチモーダル指示調整フレームワークは、画像と点群を超える追加モダリティへ拡張できるか？
RQ3GPTベースの評価指標は、マルチモーダル生成と推論タスクのタスクグラウンドトゥルース性能と相関があるか？

主な発見

Task	Dataset	Zero-shot (LAMM)	Finetune (LAMM)
2D Classification	CIFAR10	37.9	91.2
2D Object Detection	VOC2012	7.20	13.48
2D VQA	SQAimage	49.88	74.27
3D Object Detection	ScanNet	9.3	11.89
3D Visual Grounding	ScanRefer	Failed	3.38
3D VQA	ScanQA	26.54	99.89

LAMMは複数の2Dビジョンタスクで強力なゼロショットベースラインを達成する一方、正確な局所化とカウントには顕著なギャップがあり、ファインチューニングで改善される。
画像タスクでのファインチューニングは性能を著しく改善する（例: CIFAR10の精度が37.9から91.2へ）。
3Dビジョンタスクはファインチューニングで大幅な向上を示し（例: ScanNetの物体検出が9.3から11.89へ、ScanQAのVQAが26.54から99.89へ）。
GPTベースの評価指標は、場合によってはBLEUベース指標よりキャプションの関連性と正確さスコアを高めることがある。
Binary Locating MetricはLAMMベースラインが他のベースラインと比べ局在能力を改善することを示すが、従来手法の一部よりは精度が低い。
データセットとフレームワークはスケーラブルで、データ量が増えるほど性能が向上する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。