QUICK REVIEW

[论文解读] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware, Animesh Gupta|arXiv (Cornell University)|Mar 26, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

VISAGE 是一个针对 MDLLMs 的零训练解码框架，在推理阶段通过对跨注意力的空间熵进行惩罚来校准解码目标，降低语言捷径并改善视觉锚定。

ABSTRACT

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

研究动机与目标

将 MDLLMs 的幻觉重新框定为由解码目标不匹配导致的局部优化误差。
提出一个无需再训练的推理框架（VISAGE）以对解码进行校准。
通过跨注意力的空间熵来量化视觉锚定，并在注意头之间实现定位共识。
为所提出的再加权提供稳定性界限，并在基准测试中展示鲁棒性。

提出的方法

将解码建模为一个忽略视觉锚定的代理目标，从而产生语言捷径。
引入 VISAGE，通过对图像 token 的最后一层跨注意力计算鲁棒的锚定熵。
使用 beta 分位数对注意头的熵进行聚合，以实现定位共识。
通过惩罚 g = 1/(1+H) 的幂 alpha 来降低对视觉上不支持的 token 的权重，并通过 u_i = c_i * g^alpha 进行重新排序。
提供一个单调、零训练的再加权机制，得到对 token 承诺的闭式 TopK 选择。
证明一个解析的稳定性界限，表明在估计误差下目标损失有界。

实验结果

研究问题

RQ1MDLLMs 的并行掩码解码是否可能与视觉锚定目标不对齐，从而产生幻觉？
RQ2是否存在一个零训练的再排序框架，可以利用跨注意力几何来检测并惩罚语言捷径？
RQ3基于熵的、实现共识的锚定是否能在多模态基准上提升可视化锚定的生成质量？
RQ4在估计误差下，所提出的 VISAGE 再加权的稳定性行为如何？
RQ5在易产生幻觉的多模态基准和通用多模态基准上，VISAGE 的表现如何？

主要发现

Method	MMMU-val (Acc %)	HallusionBench (Acc %)	POPE (F1 %)	MME (Score)
MMaDA (Base)	27.11	34.18	75.97	1383.29
MMaDA + VCD	28.44	34.80	75.85	1342.21
MMaDA + VISAGE (Ours)	29.44	36.83	76.17	1372.05

VISAGE 在易受幻觉影响的基准上实现提升：相较基础模型，MMM U-val 提升 +8.59%、HallusionBench 提升 +7.75%。
VISAGE 在 POPE 上提升 +0.26%，并在 MME 上保持接近基线，表明通用生成质量得到维持。
MMMU-val、HallusionBench 和 POPE 的 Top-1 结果相对于 MMaDA 和 VCD 基线显示出一致的提升。
消融研究表明对 MME 任务而言 alpha=0.3 最优，在锚定和语言先验之间取得平衡。
beta-分位数头部共识（β=0.25）在稳健锚定熵方面优于均值或最小池化。
VISAGE 提供稳定性界限：在估计误差下目标损失被 2k_t ε_t 所界定。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。