QUICK REVIEW

[논문 리뷰] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware, Animesh Gupta|arXiv (Cornell University)|2026. 03. 26.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

VISAGE는 MDLLMs를 위한 학습 없이도 작동하는 디코딩 프레임워크로, 추론 시 교차 주의에서 공간 엔트로피가 높은 토큰을 벌점하여 디코딩 목표를 보정하고, 언어적 편향(축약)들을 줄이며 시각적 바인딩을 개선합니다.

ABSTRACT

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

연구 동기 및 목표

MDLLMs의 환각을 디코딩 목표 불일치로 인한 국소 최적화 오류로 재정의한다.
재학습 없이 디코딩을 보정하는 학습-free 추론 프레임워크(VISAGE)를 제안한다.
교차 주의의 공간 엔트로피를 통해 시각적 바인딩을 정량화하고 헤드 간 위치 합의를 강제한다.
제안된 재가중치에 대한 안정성 경계를 제공하고 벤치마크 전반에서 강건성을 입증한다.

제안 방법

시각적 바인딩을 무시하는 프록시 목적함수로 디코딩을 모델링하여 언어적 축약을 초래한다.
마지막 계층의 교차 주의에서 이미지 토큰에 대한 강건한 바인딩 엔트로피를 계산하기 위해 VISAGE를 도입한다.
헤드 엔트로피를 베타 분위수(beta-quantile)로 집계하여 로컬라이제이션 합의를 강제한다.
시각적으로 지원되지 않는 토큰의 가중치를 g = 1/(1+H) 의 거듭제곱 alpha로 벌점화하고 u_i = c_i * g^alpha로 재랭크한다.
학습-free로 단조로운 재가중 메커니즘을 제공하여 토큰 선택에 대한 닫힌 형식의 TopK 선택을 산출한다.
추정 오차 하에서 목적 함수 손실이 제한됨을 보이는 해석적 안정성 경계를 증명한다.

실험 결과

연구 질문

RQ1MDLLMs의 병렬 마스킹 디코딩이 시각적 바인딩 목표와 어긋나 환각을 유발할 수 있는가?
RQ2학습-free 재랭킹 프레임워크가 교차 주의 기하학을 이용해 언어 축약을 감지하고 벌점할 수 있는가?
RQ3엔트로피 기반의 합의가 가능한 바인딩이 멀티모달 벤치마크에서 시각적으로 바인딩된 생성 품질을 향상시키는가?
RQ4제안된 VISAGE 재가중의 추정 오차 하에서의 안정성 거동은 어떤가?
RQ5VISAGE가 환각에 민감한 벤치마크와 일반 용도 멀티모달 벤치마크에서 어떻게 수행하는가?

주요 결과

Method	MMMU-val (Acc %)	HallusionBench (Acc %)	POPE (F1 %)	MME (Score)
MMaDA (Base)	27.11	34.18	75.97	1383.29
MMaDA + VCD	28.44	34.80	75.85	1342.21
MMaDA + VISAGE (Ours)	29.44	36.83	76.17	1372.05

VISAGE는 환각에 민감한 벤치마크에서 향상: 기본 모델 대비 MMMU-val +8.59%, HallusionBench +7.75%.
POPE에서 +0.26%를 달성하고 MME에서 베이스라인 근처를 유지하여 일반 생성 품질이 유지됨을 시사한다.
MMMU-val, HallusionBench, POPE의 Top-1 결과가 MMaDA와 VCD 베이스라인 대비 일관되게 증가함을 보인다.
특성 제거(ablation) 결과 MME 태스크에서 alpha=0.3이 로컬라이제이션과 언어 프라이어를 균형 있게 유지하는 최적임을 보인다.
β-분위수 헤드 합의(β=0.25)가 강건한 바인딩 엔트로피를 위해 평균(mean)이나 최솟값(min) 풀링보다 우수하다.
VISAGE는 안정성 경계를 제공한다: 추정 오차 하에서 목적 함수 손실이 2k_t ε_t로 한정된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.