QUICK REVIEW

[論文レビュー] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Chenjun Li|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

PulseFocus はトレーニングを要さない推論時手法で、ソフトアテンションゲーティングを用いた計画とフォーカスブロックの交互構成により、マルチ画像推論で T2I アテンションをシャープ化し、BLINK や MuirBench などのベンチマークで一貫した獲得を生む。

ABSTRACT

Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

研究の動機と目的

Reasoning VLMs がマルチ画像タスクで苦戦する原因を調査し、チェーン・オブ・ソウト中の内部アテンションダイナミクスを特定する。
トレーニング不要の介入を提案し、マルチ画像VLMの画像焦点推論を改善する。
標準的なマルチ画像ベンチマークで PulseFocus を評価し、ベースラインに対する利益を定量化する。
アテンションフォーカスの定性的分析と失敗モードの緩和を行う。

提案手法

チェーン・オブ・ソウト中のテキスト-画像アテンションを分析し、拡散パルスと位置バイアスを特定する。
PulseFocus を導入：間隔のある <plan> / <focus:I> プロンプティングとソフトアテンションゲーティング。
<focus:I> ブロック時に参照画像以外のトークンへ負オフセットを追加してソフトゲーティングを実装。
予算を課す：計画/フォーカストークンの上限と最大の計画-フォーカスサイクル数。
Standard CoT、Cross Non-Causal、Plan-Focus（ゲーティングなし）を複数のモデルとデータセットで比較。

Figure 1 : Example case (from MuirBench). Baseline CoT fails to focus on the key evidence image (I5): token-level T2I colouring remains diffuse, and the model cannot recognize the second car. With PulseFocus , the <focus:I5> block becomes consistently image-aligned and the final answer is corrected

実験結果

リサーチクエスチョン

RQ1マルチ画像 CoT 中の内部 T2I アテンションダイナミクスはどうなるか。
RQ2推論時 prompting 戦略はアテンションの拡散を低減し、画像固有の推論を改善できるか。
RQ3PulseFocus は BLINK、MuirBench、Visual Haystacks のモデルファミリに対して性能にどのような影響を与えるか。

主な発見

Model	Params	Benchmark	Baseline	Ours	Delta Acc
InternVL3.5	8B	MuirBench	56.81	57.88	+1.07
Qwen3-VL	4B	MuirBench	55.56	56.38	+0.82
InternVL3.5	8B	BLINK	50.45	54.18	+3.73
Qwen3-VL	2B	BLINK	55.55	56.40	+0.85

PulseFocus は BLINK (InternVL3.5-8B: +3.73%) のマルチ画像推論性能を改善し、MuirBench でも競争力を発揮。
PulseFocus は複数の BLINK サブタスクで利益を生み出し、特にマルチビュ推論 (+15.79) と空間関係 (+4.90) で顕著。
ベースラインの CoT では 2,600 件の MuirBench サンプルにおいて拡散する T2I アテンションパルスと初期画像へのバイアスが散見。
ソフトアテンションゲーティングはデコード時のアテンションを参照画像に集中させ、画像間の混乱を低減。
構造化された交互の計画-フォーカス prompting は、訓練なしで系統的な画像ごとの推論を促進するのに役立つ。

Figure 2 : Attention pulse visualization. T2I attention mass per image over CoT decode steps for a counting task (the same example as in Figure 1, with six input images). Top: baseline—attention is spread across images even when discussing a specific image. Bottom: with PulseFocus —attention concent

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。