QUICK REVIEW

[논문 리뷰] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li|arXiv (Cornell University)|2026. 03. 03.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

UniG2U-Bench가 통합 멀티모달 모델의 생성이 이해를 향상시키는지 체계적으로 평가하여, 전반적으로 저하를 보이지만 공간적, 착시에 민감한, 다단계 추론 과제에서 과제별 이익을 발견한다.

ABSTRACT

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

연구 동기 및 목표

통합 멀티모달 모델(UMMs)의 생성이 기본 VLM보다 실제로 이해를 향상시키는지 평가한다.
생성-이해(G2U) 이익이 서로 다른 인지적 요구를 가진 과제에서 어떻게 달라지는지 특성화한다.
엄격한 기본 모델 페어링 및 예산 매칭 비교를 통해 생성의 인과 효과를 분리한다.

제안 방법

생성 기능이 이해에 도움이 되는지(G2U) 정의하고, 매칭된 예산 하에서 UM모델과 그것의 구분적 기본 VLM을 페어링한다.
다양한 데이터셋에서 수집된 7개의 추론 체계와 30개 하위과제로 구성된 3,000개의 샘플로 UniG2U를 구축한다.
직접 추론과 생성-응답(GtA) 추론 하에서 UM모델을 평가하여 G2U 효과를 고립한다.
중간 시각 정보를 위한 두 가지 진단 지표를 도입한다: Reasoning-Alignment(RA)과 Answer-Alignment(AL).
G2U 이익을 Direct와 GtA 구성요소로 분해하고, 과제 계열과 모델 원형 간의 차이를 분석한다.

Figure 1 : Model Performance Radar Chart

실험 결과

연구 질문

RQ1언제 통합 멀티모달 모델에서 생성이 이해를 향상시키거나 저하시킬까?
RQ2어떤 과제 체계나 인지적 요구가 일관된 G2U 이점이나 해를 나타내는가?
RQ3생성과 모델 아키텍처가 과제 전반에 걸쳐 클래스 일관적 귀납적 편향을 유도하는가?
RQ4중간 시각적 산물은 과제와 모델 간 최종 답변과 어떻게 상관관계가 있는가?

주요 결과

통합 모델은 일반적으로 표준 이해 과제에서 기본 VLM보다 성능이 떨어진다.
생성-응답(GtA)은 직접 추론에 비해 성능이 저하되는 경향이 있다.
공간적, 착시 민감한, 다회 차 추론 하위과제에서 시각 변환을 외부화할 때 일관된 개선을 보인다.
공유되는 추론 구조를 가진 과제들은 아키텍처를 공유하는 모델 간에 상관된 행동을 보인다.
생성-이해 결합은 사전 학습 데이터와 아키텍처에 의해 형성된 귀납 편향을 드러낸다.

Figure 2 : Taxonomy of unified multimodal models (UMMs). All models annotated in the figure are benchmarked in this work.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.