QUICK REVIEW

[論文レビュー] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li|arXiv (Cornell University)|Mar 3, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

UniG2U-Benchは統合マルチモーダルモデルにおける生成が理解に寄与するかを体系的に評価し、空間・幻覚・多段階推理タスクで全体的には劣化を、タスク特異的には利得を示す。

ABSTRACT

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

研究の動機と目的

統合マルチモーダルモデル（UMMs）における生成が、ベースVLMと比較して本当に理解を向上させるかを評価する。
異なる認知的要求を持つタスク間での生成-to-understanding（G2U）利得がどのように変化するかを特徴づける。
厳格なベースモデルペアリングと予算一致の比較により、生成の因果効果を特定する。

提案手法

Generation Helps Understanding（G2U）を定義し、統合モデルを予算が一致した discriminative base VLMとペアリングする。
7つの推論レジームと30のサブタスクから成る3,000サンプルを多様なデータセットから構築してUniG2Uを作成する。
Direct推論とGenerate-then-Answer（GtA）推論の下で統合モデルを評価し、G2U効果を分離する。
中間ビジュアルの診断指標を2つ導入する：Reasoning-Alignment（RA）とAnswer-Alignment（AL）。
G2U利得をDirect対GtA成分に分解し、タスクファミリーとモデルアーキタイプ全体で分析する。

Figure 1 : Model Performance Radar Chart

実験結果

リサーチクエスチョン

RQ1統合マルチモーダルモデルで生成は理解を改善するのか、それとも低下させるのか？
RQ2どのタスクレジームや認知要求で一貫したG2Uの利点や害が現れるか？
RQ3生成とモデルアーキテクチャはタスク間でクラス一貫した帰納的バイアスを誘発するか？
RQ4中間ビジュアルアーティファクトはタスクとモデル間の最終回答とどのように相関しているか？

主な発見

統合モデルは標準的な理解タスクで一般的にベースVLMを下回る。
Generate-then-Answer（GtA）は直接推論と比較して性能を劣化させる傾向がある。
空間的・幻覚感度の高いサブタスクおよび複数回の推論を伴うタスクは、視覚変換を外部化すると一貫して改善を示す。
共通の推論構造を持つタスクは、アーキテクチャを共有するモデル間で相関した挙動を示す。
生成と理解の結合は、事前学習データとアーキテクチャによって形作られる帰納的バイアスを明示する。

Figure 2 : Taxonomy of unified multimodal models (UMMs). All models annotated in the figure are benchmarked in this work.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。