QUICK REVIEW

[論文レビュー] mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu|arXiv (Cornell University)|Aug 9, 2024

Natural Language Processing Techniques被引用数 6

ひとこと要約

本論文は、Hyper Attentionブロックを備えた軽量なマルチモーダルLLMであるmPLUG-Owl3を提案し、長い画像シーケンスにおいて視覚と言語を効率的に統合する。類似サイズのモデルの中で単一画像・複数画像・動画ベンチマークにおいて最先端の成果を達成し、長い視覚的文脈に対するDistractor Resistanceを導入する。

ABSTRACT

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

研究の動機と目的

マルチモーダルLLMにおける長い画像シーケンス理解の課題に対処する。
視覚的情報を保持しつつ時間的スコープを拡張する軽量なクロスアテンション機構を開発する。
入れ替えられた視覚と言語の入力と超長い画像シーケンスの効率的処理を可能にする。
単一画像・複数画像・動画ベンチマークでの性能を評価し、長文脈のディストラクタテストを導入する。

提案手法

Hyper Attention Transformer Block (HATB)を導入し、自己注意を並行してクロス注意で拡張する。
Interleaved sequences内の視覚的位置をエンコードするMultimodal-Interleaved Rotary Position Embedding (MI-Rope)を使用する。
テキスト文脈に基づいて視覚特徴とテキスト特徴を統合する適応ゲート機構を採用する。
言語モデルから初期化された視覚KV射影を共有して視覚情報を保持する。
3段階で訓練する：画像-テキストの事前学習、マルチ画像の事前学習（インタリーブ、テキスト豊富、動画データを含む）、指示データを用いた監督付きファインチューニング。

Figure 1: (a) mPLUG-Owl3 demonstrates leading performance on video and multi-image understanding. (b,c,d) Examples of mPLUG-Owl3 on handling different scale of multi-image scenarios.

実験結果

リサーチクエスチョン

RQ1過剰なメモリ使用や遅延なしに、マルチモーダルLLMで長いインタリーブ画像-テキストシーケンスを効率的に融合するにはどうすればよいか。
RQ2Hyper Attentionは、従来のクロスアテンション手法と比較して、単一画像・複数画像・動画タスクの堅牢な性能を提供するか。
RQ3MI-RopeとAdaptive gatingがマルチ画像および動画理解の性能に与える影響は何か。
RQ48Bパラメータのモデルが、効率を維持しつつ多様なマルチモーダルベンチマークで最先端の結果を達成できるか。

主な発見

mPLUG-Owl3は、単一画像・複数画像・動画タスクをカバーする20件のベンチマークのうち、同規模のモデルの中で14件で最先端の性能を達成する。
Hyper Attention設計は単一画像で強力な結果を生み出し、連結ベースおよび他のクロスアテンション方式よりもマルチ画像および動画の性能を向上させる。
MI-Ropeと適応ゲートはマルチ画像および動画理解に顕著な寄与をし、統合された4つのHATB層が性能と効率のバランスを提供する。
超長い視覚シーケンスとディストラクターが豊富な文脈で、mPLUG-Owl3は頑健なディストラクター耐性を示し、長文脈評価でいくつかのベースラインを上回る。
この8Bパラメータ規模のモデルは、軽量なHyper Attentionブロックにより、より大きな同輩と比べて推論速度とメモリ効率に優れる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。