QUICK REVIEW

[论文解读] Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Zipeng Zhu, Zhanghao Hu|arXiv (Cornell University)|Feb 4, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文提出 LASER，一种训练无关、层自适应的 LVLM 框架，采用 Visual Activation by Query (VAQ) 与 Visual Activation of Tokens (VAT) 进行面向查询的视觉定位与对比解码，在多项基准测试中提升 grounding 与 VQA 准确率。

ABSTRACT

Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.

研究动机与目标

推动超越 LVLMs 中固定层视觉 grounding 的必要性，原因在于 token 枷锁与语言先验的影响。
证明视觉 grounding 是层相关且对查询敏感的动态过程，而非静态。
开发 VAQ，以识别给定查询最具信息量的层。
提出 LASER，一种无需训练的过程，结合 VAT 基于验证实现层自适应定位与解码。
在不同输入分辨率的模型上，在多样化的 VQA 基准测试中展示经验性提升。

提出的方法

对比注意力：通过用查询注意力减去无查询注意力，获得查询驱动的视觉 grounding。
VAQ（Visual Activation by Query）：逐层量化查询对注意力的调制强度，选择对定位最具激活性的层。
Constrained Visual Cropping（Con-ViCrop）：用 VAQ 选定层的对比注意力图裁剪图像，聚焦证据所在区域。
Visual Activation of Tokens（VAT）：比较裁剪后（正样本）与对照事实（证据被遮蔽）输入的对数几率，以在解码阶段促进由视觉证据支持的 token。
层自适应解码：将 VAT 融入 logits（带缩放因子），使输出偏向于视觉 grounding 的答案 token。
推理过程 LASER：无需训练、查询感知的视觉定位与解码，通过 VAQ/VAT 进行增强，并包含对照事实的验证。

实验结果

研究问题

RQ1LVLMs 中的视觉 grounding 是单一层的静态属性，还是取决于查询复杂度的动态过程？
RQ2在不额外训练的情况下，查询条件化、层感知的方法是否能提升视觉定位与解码？
RQ3VAQ 与 VAT 是否能够实现更真实的视觉 grounding，并降低 VQA 基准中的语言先验？
RQ4随着任务难度和 LVLM 架构的不同，动态层选择如何变化？
RQ5应用 LASER 的额外注意力通过和对照事实解码带来的时间成本权衡如何？]
RQ6key_findings:[]

主要发现

LA S E R 在 POPE、TextVQA 与 A-OKVQA 基准上持续提升 VQA 准确率，相较静态层注意力方法与其他训练-free 基线表现更优。
VAQ 显示最优 grounding 会随查询复杂度而变化，简单任务偏好中间层，复杂推理偏好更深层。
通过 VAQ 的动态层选择在 RefCOCO+ 与 RefCOCOg 上获得比原始或相对注意力更高的定位注意力聚合。
VAT 指导的对比解码有助于通过促进基于视觉证据的 token 来抑制语言先验。
消融实验显示去除 VAQ 或 VAT 会降低增益，且使用动态层选择进行裁剪优于固定层裁剪。
LASER 由于额外的注意力计算与对照解码带来适度的时间开销，但在高端 GPU 上仍然可并行化且实用。
在 LLaVA-1.5 与 Qwen-VL 的实验表明，LASER 对固定分辨率与任意分辨率的 LVLM 架构皆有益处，在高分辨率裁剪场景中收益更大。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。