QUICK REVIEW

[논문 리뷰] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Chenjun Li|arXiv (Cornell University)|2026. 03. 04.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

PulseFocus는 훈련 없이 추론 시에 interleaved plan 및 focus 블록을 소프트 어텐션 게이팅으로 제약하여 다중 이미지 추론에서 T2I 어텐션을 예리하게 만들고 BLINK 및 MuirBench와 같은 벤치마크에서 일관된 이득을 얻는 방법이다.

ABSTRACT

Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

연구 동기 및 목표

다중 이미지 작업에서 추론 VLM이 왜 어려움을 겪는지 내부 어텐션 다이나믹스를 확인한다.
다중 이미지 VLM의 이미지 중심 추론을 개선하기 위한 훈련-free 개입을 제안한다.
기준 벤치마크에서 PulseFocus를 평가하여 baseline 대비 이득을 정량화한다.
어텐션 포커싱의 정성적 분석 및 실패 모드 완화를 제공한다.

제안 방법

chain-of-thought 동안 텍스트-투-이미지 어텐션을 분석하여 확산 펄스와 위치 바이어스를 식별한다.
PulseFocus를 도입: interleaved <plan> / <focus:I> 프롬프팅과 소프트 어텐션 게이팅을 적용한다.
<focus:I> 블록에서 비초점 이미지 토큰에 음의 보정을 더해 소프트 게이팅을 구현한다.
예산을 강제한다: plan/focus 토큰 한도와 최대 계획-포커스 주기 수를 설정한다.
다양한 모델 및 데이터셋에서 Standard CoT, Cross Non-Causal, Plan-Focus (no gating) 베이스라인과 비교한다.

Figure 1 : Example case (from MuirBench). Baseline CoT fails to focus on the key evidence image (I5): token-level T2I colouring remains diffuse, and the model cannot recognize the second car. With PulseFocus , the <focus:I5> block becomes consistently image-aligned and the final answer is corrected

실험 결과

연구 질문

RQ1다중 이미지 CoT 동안 추론 VLM의 내부 T2I 어텐션 다이나믹스는 무엇인가?
RQ2추론 시 프롬프팅 전략이 어텐션 확산을 줄이고 이미지 특정 추론을 개선할 수 있는가?
RQ3제안된 PulseFocus가 모델 패밀리 전반에 걸쳐 BLINK, MuirBench, Visual Haystacks의 성능에 어떤 영향을 미치는가?

주요 결과

모델	파라미터	벤치마크	기준	우리 제안	Δ 정확도
InternVL3.5	8B	MuirBench	56.81	57.88	+1.07
Qwen3-VL	4B	MuirBench	55.56	56.38	+0.82
InternVL3.5	8B	BLINK	50.45	54.18	+3.73
Qwen3-VL	2B	BLINK	55.55	56.40	+0.85

PulseFocus는 BLINK에서 multi-image 추론 성능을 향상시키며 (InternVL3.5-8B: +3.73%) 및 MuirBench에서 경쟁력이 있다.
PulseFocus는 여러 BLINK 하위 작업에서 이득을 제공하며, 특히 multi-view reasoning (+15.79) 및 spatial relations (+4.90)에서 두드러진다.
Baseline CoT는 2,600 개의 MuirBench 샘플에서 불균일한 T2I 어텐션 펄스와 초기 이미지에 편향을 보인다.
소프트 어텐션 게이팅은 디코드 타임 어텐션을 참조 이미지에 집중시키고 이미지 간 혼란을 줄인다.
구조화된 인터리브드 plan-focus 프롬프팅은 훈련 없이도 체계적인 이미지별 추론을 가능하게 한다.

Figure 2 : Attention pulse visualization. T2I attention mass per image over CoT decode steps for a counting task (the same example as in Figure 1, with six input images). Top: baseline—attention is spread across images even when discussing a specific image. Bottom: with PulseFocus —attention concent

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.