QUICK REVIEW

[論文レビュー] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li, Minghui Guo|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

UniM は any-to-any インタリーブ型マルチモーダル学習の初の統一ベンチマークを導入。31Kインスタンス、7モダリティ、30ドメインのデータセット、評価スイート、強力なエージェンティックベースライン UniMA を特徴とする。結果は現在の MLLM が統一的なインタリーブタスクに苦戦している一方で、UniMA が堅牢なベースラインと今後の研究への洞察を提供することを示している。

ABSTRACT

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

研究の動機と目的

現実世界の相互作用を反映するために、any-to-anyインタリーブ型マルチモーダル学習パラダイムを動機づけ、運用化する。
複数のモダリティとドメインを跨ぐ大規模で高品質なデータセットを提供する。
意味的正確性、構造、インタリーブの一貫性を捉える principled な評価スイートを開発する。
将来のMLLM を benchmarking する追跡可能な推論を伴う堅牢なベースラインモデルを提供する。
統一的インタリーブ型マルチモーダル知能の進展に向けた課題と方向性を浮き彫りにする。

提案手法

30ドメイン×7モダリティ（テキスト、画像、音声、動画、ドキュメント、コード、3D）から高品質なインタリーブ型データを31,026件キュレーションする。
モダリティプレースホルダを含むオープンフォームQA形式を設計し、any-to-anyインタリーブ入力出力を模擬する。
Semantic CorrectionとGeneration Quality、Response Structure Integrity、Interleaved Coherence の三つの次元から構成される UniM Evaluation Suite を導入する。
Traceable Evidence Reasoning (TER) モジュールとタスク条件付き証拠アプローチを備えたエージェンティックベースライン UniMA を提案する。
データ品質を確保するため、手動レビューと独立チェッキングを含む2段階の品質管理プロセスを用いる。
Pearson 相関とアブレーション研究を通じて、人間の判断と整合する自動評価指標を用いてモデルを評価する。

実験結果

リサーチクエスチョン

RQ1現在の MLLM は多様なモダリティとドメインに跨る統一的any-to-anyインタリーブ型タスクをどの程度処理できるか？
RQ2統一インタリーブ型パラダイムで評価した場合、既存の MLLM の強みと限界は何か？
RQ3推論を追跡できるエージェンティックベースラインは UniM タスクの性能と信頼性を向上させ得るか？
RQ4インタリーブ型マルチモーダル生成における意味的正確性、構造的整合性、インタリーブの一貫性を公正に評価する指標設計はどうあるべきか？

主な発見

UniMA は UniM 全体の複数指標でベースラインモデルを大幅に上回り、意味的正確性、生成品質、インタリーブの一貫性のスコアを高く獲得する。
ベースラインモデルは絶対的な SQCS および ICS スコアが低く、タスクの複雑さが増すと構造と一貫性が大きく低下する。
UniMA は複数分野でベースラインより 2–6 倍の StS/LeS、約 15–40 倍の ICS を示し、モダリティのカバレッジと協調が優れていることを示す。
評価指標 SQCS および ICS は人間の判断と強い相関を示す（Pearson r ≈ 0.974 および 0.960）。
UniM のデータは 30 ドメイン・7 モダリティにまたがり、多タスク・マルチモーダル推論を強調し、Easy/Medium/Hard の難易度レベルを持つ。
アブレーション研究は TER が構造的適合にとって重要であり、検証サブモジュールが信頼性の高いインタリーブ出力のために不可欠であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。