QUICK REVIEW

[論文レビュー] Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Zipeng Zhu, Zhanghao Hu|arXiv (Cornell University)|Feb 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本論文は LASER を提案する。訓練不要で層適応的な LVLMs のフレームワークで、Visual Activation by Query (VAQ) と Visual Activation of Tokens (VAT) を用いてクエリ認識型の視覚局在とコントラスト Decoding を行い、複数のベンチマークで grounding および VQA 精度を向上させる。

ABSTRACT

Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.

研究の動機と目的

LVLM における固定層視覚 grounding に対するトークンボトルネックと言語 priors の課題を動機づける。
視覚 grounding は静的ではなく層依存かつクエリ感受性のある動的プロセスであることを示す。
VAQ を開発して与えられたクエリにとって最も情報量の多い層を特定する。
訓練不要な手法 LASER を提案し、VAT ベースの検証を用いた層適応的局在化とデコーディングを行う。
さまざまな入力解像度を持つモデルでの VQA ベンチマークで経験的利得を示す。

提案手法

Contrastive Attention：クエリフリーの注意機構からクエリ付きの注意機構を差し引くことで、クエリ駆動の視覚 grounding を分離する。
VAQ（Visual Activation by Query）：層ごとにクエリによってどれだけ注意がモジュレートされるかを定量化し、局在化のトップ活性化層を選択する。
Constrained Visual Cropping（Con-ViCrop）：VAQ 選択層のコントラスト注意マップを用いて、証拠を含む領域に焦点を当てる。
Visual Activation of Tokens（VAT）：切り抜き入力（陽性）と反事実入力（証拠が遮断される）からのロジットを比較し、視覚的証拠に基づくトークンをデコーディング中に支持する。
層適応デコーディング：VAT をロジットへ統合（スケーリング因子付き）して、視覚 grounding された解答トークンへバイアスをかける。
推論手順 LASER：訓練不要、クエリ認識的な視覚局在化とデコーディングを VAQ/VAT で強化し、反事実検証を含む。

実験結果

リサーチクエスチョン

RQ1LVLM における視覚 grounding は単一層の静的特性か、それともクエリの複雑さに依存する動的プロセスか？
RQ2追加訓練なしで、クエリ条件付け済みの層認識アプローチは視覚局在化とデコードを改善できるか？
RQ3VAQとVAT は VQA ベンチマーク全体でより忠実な視覚 grounding を可能にし、言語 priors を減らせるか？
RQ4タスクの難易度や LVLM アーキテクチャ間で動的層選択はどう変化するか？
RQ5LASER の追加の注意パスと反事実デコーディングを適用する際の時間コストのトレードオフは？

主な発見

LA S E R は固定層アテンション手法や他の訓練不要ベースラインと比較して、POPE、TextVQA、A-OKVQA の全ベンチマークで一貫して VQA 精度を向上させる。
VAQ は最適な grounding がクエリの複雑さとともにシフトすることを示し、単純なタスクでは中間層、複雑な推論ではより深い層を好む。
VAQ による動的な層選択は、Raw or Relative Attention よりも RefCOCO+ および RefCOCOg で局在化アテンションの集約を高める。
VAT 主導のコントラストデコーディングは、視覚的証拠に基づくトークンを促進することで言語 priors を抑制するのに役立つ。
アブレーションでは VAQ または VAT を除去すると利得が減少し、動的層選択を用いたクロッピングは固定層クロッピングよりも優れている。
LASER は追加の注意パスと反事実デコーディングによりわずかな時間オーバーヘッドを生むが、高性能 GPU 上で並列化可能かつ実用的な範囲に留まる。
LLaVA-1.5 および Qwen-VL の実験は、LASER が固定解像度および任意解像度の LVLM アーキテクチャの双方に利益をもたらし、高解像度のクロッピングシナリオでより大きな利得を生むことを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。