QUICK REVIEW

[논문 리뷰] MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Mingrui Wu, Hang Liu|arXiv (Cornell University)|2026. 02. 23.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

MICON-Bench는 멀티 이미지 컨텍스트 생성용 여섯 과제 벤치마크를 도입하고, Cross-image 일관성을 향상시키는 플러그앤플레이 방식의 Dynamic Attention Rebalancing(DAR)을 제시합니다. Evaluation-by-Checkpoint 프레임워크는 자동 평가를 위해 MLLM 검증기를 사용합니다.

ABSTRACT

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce extbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present extbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.

연구 동기 및 목표

다양한 관련 참조 이미지들에 조건화된 이미지를 처리하고 생성하는 통합 멀티모달 모델(UMMs)의 능력을 평가한다.
교차 이미지 구성, 맥락적 추론, 정체성 보존을 탐색하는 여섯 개 과제를 포함하는 포괄적이고 확장 가능한 벤치마크(MICON-Bench)를 제공한다.
작업 간 객관적 점수를 위한 MLLM 검증기를 이용한 자동 Evaluation-by-Checkpoint 프레임워크를 도입한다.
추론 시 주의 배치를 개선하고 교차 이미지 환상을 줄이기 위한 학습 없는 메커니즘(Dynamic Attention Rebalancing, DAR)을 제안한다.

제안 방법

다음의 여섯 가지 멀티-이미지 컨텍스트 작업을 정의한다: Object Composition, Spatial Composition, Attribute Disentanglement, Component Transfer, FG/BG Composition, Story Generation.
사전 정의된 시각/의미 체커포인트를 MLLM이 검증하고 평균치로 산출되는 이진 Pass/Fail을 내는 Evaluation-by-Checkpoint 파이프라인을 사용한다.
쿼리 토큰의 하위 집합을 샘플링하여 참조 토큰의 중요도를 헤드 간에 추정하고 주의 맵을 계산 및 조정한다(DAR).
참조-토큰 주의 점수의 최소-최대 정규화를 적용하여 매우 관련성이 높은 영역과 무관한 영역을 식별한다.
주목도 재조정 임계치(tau_high, tau_low, gamma)를 사용하여 주의 계산에서 참조 토큰의 가중치를 동적으로 조정하고 영역에 대한 주의 집중을 높이거나 억제한다.
DAR를 최소 오버헤드로 교차 이미지 일관성을 향상시키는 플러그인식 학습 없는 모듈로 시연한다.

Figure 1 : Overview of MICON-Bench and Evaluation Pipeline. MICON-Bench is a comprehensive benchmark designed to evaluate multi-image context generation across six diverse tasks: Object Composition, Spatial Composition, Attribute Disentanglement, Component Transfer, FG/BG Composition, and Story Gene

실험 결과

연구 질문

RQ1현재의 UMM은 여러 참조 이미지를 이용해 일관되게 이미지를 생성하면서 참조 간의 정체성과 관계를 보존할 수 있는가?
RQ2MICON-Bench가 최첨단 모델에서 교차 이미지 추론 및 일관성 문제를 얼마나 잘 드러내는가?
RQ3제안된 Dynamic Attention Rebalancing(DAR)이 다중 참조에 걸친 객체 인식, 공간 추론, 속성 일관성을 향상시키는가?
RQ4참조 이미지의 수가 생성 성능과 모델 융합의 강인성에 어떤 영향을 미치는가?
RQ5MLLM 기반 검증이 과제 전반에서 표준 지각적 및 의미 메트릭과 상관관계가 있는가?

주요 결과

모델	객체	공간	속성	구성요소	FG/BG	스토리	평균 점수
Nano-Banana	95.60	93.79	92.13	84.23	83.13	82.84	89.25
GPT-Image	96.45	94.41	93.39	87.69	90.15	85.99	91.51
UNO	58.40	66.68	65.28	28.84	20.96	39.08	44.76
DreamOmni2	88.24	84.76	85.28	59.64	76.16	59.58	75.56
Qwen-Image-Edit-2507	96.52	88.80	78.04	42.68	72.08	63.81	72.96
BAGEL	87.64	89.96	89.84	52.40	64.64	65.09	73.55
BAGEL + DAR	88.04	91.88	90.76	56.06	71.24	66.34	76.31
OmniGen2	89.52	80.32	81.64	44.76	57.96	60.96	67.83
OmniGen2 + DAR	89.84	81.00	82.12	48.72	59.28	60.73	69.21

DAR은 여러 과제에서 OmniGen2와 BAGEL을 지속적으로 개선하며, 특히 Component, FG/BG, Story 성능에서 두드러집니다.
최첨단 UMM은 교차 이미지 일관성에 어려움을 겪으며 종종 참조들에 걸쳐 주의를 균일하게 분배합니다.
참조 이미지 수를 늘리면 BAGEL 및 OmniGen2의 성능이 저하되어 다중 참조 설정에서 융합 문제가 있음을 시사합니다.
DAR은 여러 벤치마크에서 CLIP, DINO v2, LPIPS 지표의 개선으로 교차 이미지 일관성 향상을 보입니다.
DAR 개선은 MICON-Bench를 넘어 OmniContext 및 XVerseBench에서도 확장되어 다양한 다중 이미지 벤치마크에서 강건성을 보여줍니다.
Table 1은 DAR 적용 여부에 따른 모델 점수를 보여주며 DAR 적용 시 전반적인 평균 향상을 나타냅니다.

Figure 2 : Overview of the proposed Dynamic Attention Rebalancing (DAR) mechanism. Given multiple reference images, DAR first samples query tokens and computes attention maps between sampled queries and reference key tokens. It then applies a dynamic weighting factor to rebalance attention responses

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.