QUICK REVIEW

[論文レビュー] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware, Animesh Gupta|arXiv (Cornell University)|Mar 26, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

VISAGE は MDLLMs のデコーディングを推論時に再調整する訓練不要のデコーディング再ランキング手法。クロスアテンションの高い空間エントロピーを抑制することで言語的ショートカットを抑制し、視覚的基礎づけを改善します。

ABSTRACT

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

研究の動機と目的

MDLLMs における幻覚を、デコoding objective のミスマッチによる局所的最適化誤差として再定義する。
retraining せずにデコーディングを校正する訓練不要の推論フレームワーク（VISAGE）を提案する。
クロスアテンションの空間エントロピーを介して視覚的基盤づけを定量化し、ヘッド間で局在化の合意を強制する。
提案手法の再重み付けに対する安定性境界を提供し、ベンチマーク間の頑健性を実証する。

提案手法

視覚基盤を無視したプロキシ目的としてデコーディングをモデル化し、言語ショートカットを生み出す。
VISAGE を導入し、画像トークン上の最終層クロスアテンションから堅牢な基盤エントロピーを算出する。
ヘッドエントロピーをβ分位数で集約して局在化の合意を強制する。
visually unsupported なトークンを g = 1/(1+H) をべき乗 α で減衰させ、u_i = c_i * g^α で再ランク付けする。
トレーニング不要の単調な再重み付け機構を提供し、トークン確約の閉形式 TopK を得る。
推定誤差の下で目的関数の損失が有限であることを示す解析的な安定性境界を提示する。

実験結果

リサーチクエスチョン

RQ1MDLLMs における並列マスクドデコーディングが視覚的基盤 objective とずれて幻覚を生むことはあるか。
RQ2訓練不要の再ランキングフレームワークはクロスアテンション幾何を用いて言語ショートカットを検出・抑制できるか。
RQ3エントロピーに基づく合意形成の基盤づけは、多模態ベンチマークにおける視覚的基盤生成を改善するか。
RQ4提案する VISAGE の再重み付けは推定誤差の下で安定性の挙動を示すか。
RQ5VISAGE は幻覚に敏感なベンチマークと汎用の多模態ベンチマークの双方でどのように機能するか。

主な発見

Method	MMMU-val (Acc %)	HallusionBench (Acc %)	POPE (F1 %)	MME (Score)
MMaDA (Base)	27.11	34.18	75.97	1383.29
MMaDA + VCD	28.44	34.80	75.85	1342.21
MMaDA + VISAGE (Ours)	29.44	36.83	76.17	1372.05

VISAGE は幻覚に敏感なベンチマークを改善：基モデル比で MMMU-val +8.59%、HallusionBench +7.75%。
VISAGE により POPE で +0.26%、MME では基準近傍を維持し、一般的生成品質を維持。
MMMU-val、HallusionBench、POPE の Top-1 結果は、MMaDA および VCD のベースラインを一貫して上回る。
アブレーションにより MME タスクでは α = 0.3 が基礎づけと語彊の priors のバランスを取り最適である。
β-分位数ヘッド合意 (β=0.25) は平均や最小プーリングより頑健な基盤エントロピーを達成。
VISAGE は安定性境界を提供：推定誤差下で目的損失は 2k_t ε_t により有界。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。