QUICK REVIEW

[論文レビュー] Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

Meishan Zhang, Hao Fei|arXiv (Cornell University)|Jun 6, 2024

Natural Language Processing Techniques被引用数 5

ひとこと要約

Grounded Multimodal Universal Information Extraction (MUIE) を導入し、テキスト・音声・画像・動画入力を横断して情報を認識・グラウンドできるマルチモーダル LLM「Reamo」と評価用の新しいベンチマークデータセットを提案。

ABSTRACT

In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research. Our resources are publicly released at https://haofei.vip/MUIE.

研究の動機と目的

複数モダリティ（テキスト、音声、画像、動画）にまたがる情報抽出タスク（NER、RE、EE）を、単一のグラウンデッド・フレームワークに統合する。
オールモダリティから情報を抽出し、グラウンディングできるマルチモーダル LLM（Reamo）を開発する。
9つのモダリティ組み合わせとグラウンディングを含む高品質な grounded MUIE ベンチマークデータセットを作成・公開する。
テキスト中心の IE 手法を超えるために、細粒度のクロスモーダル・グラウンディングと評価を可能にする。

提案手法

grounded MUIE タスクを提案し、出力を UIE ラベルとモダリティ横断の細粒度グラウンディングとして formalize する。
ImageBind をマルチモーダルエンコーダとして、Vicuna を LLM ボトムアップに用い、視覚用の SEEM、音声用の SHAS をモジュラーなグラウンディングデコーダとして組み込んだ Reamo を設計する。
テキストデータでの UIE 指示調整（UIE instruction tuning）、X-caption データでのマルチモーダル整合、フレーズグラウンディングを用いた細粒度グラウンディング調整で Reamo を微調整する。
Reamo がまず UIE を実行し、その後下流のグラウンディングモジュールが画像内の物体/セグメント・動画のトラック・音声セグメントのグラウンディングを生成するパイプライン型アプローチを採用する。
9つのモダリティ組み合わせに across のテストインスタンス 3,000 件を構築し、モダリティ共通・モダリティ固有のグラウンディングを含むグラウンディング精度と IE パフォーマンスを評価するベンチマークを作成する。

実験結果

リサーチクエスチョン

RQ1 grounded MUIE フレームワークの下で、IE タスク（NER、RE、EE）をテキスト、画像、音声、動画に跨ってどのように統一できるか。
RQ2専用のマルチモーダル LLM（Reamo）は、すべてのモダリティにわたって情報抽出と細粒度のマルチモーダルグラウンディングを共同で実行できるか。
RQ3グラウンディングの可用性とモダリティの整合性は、さまざまなモダリティ組み合わせにおける IE パフォーマンスにどのように影響するか。
RQ4どのベンチマークと評価プロトコルが grounded MUIE の能力を最も適切に測定し、将来の研究の標準を設定するか。

主な発見

Reamo は、テキスト+画像、テキスト+音声、テキスト+動画、および純粋モダリティ入力に対して、既存の MLLMs と比較して強いゼロショット性能を示す。
Reamo は NER、RE、EE タスクでパイプラインベースラインを上回り、画像セグメンテーション、音声セグメンテーション、動画トラッキングのような優れたマルチモーダルグラウンディングを提供する。
モダリティのズレが生じたシナリオでもロバストで、モダリティ共有・モダリティ特有の設定でベースラインを上回る。
複雑なモダリティ混合シナリオ（例：text+image+audio、text+video+audio）でもゼロショット結果が一貫して向上。
グラウンディング能力と IE 精度は、エンティティ/オブジェクト数の増加とともに概ね緩やかに低下するが、Reamo はベースラインより優位性を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。