QUICK REVIEW

[논문 리뷰] Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

Chengxu Yang, Jingling Yuan|arXiv (Cornell University)|2026. 03. 26.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

CLVA는 훈련 없이 작동하는 방법으로, 멀티모달 LLM의 심층 계층 주의 drift를 중간 계층 시각 앵커와 초기 계층 잡음 앵커를 추출함으로써 상쇄하고, 최소한의 오버헤드로 사실적 근거를 개선한다.

ABSTRACT

Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.

연구 동기 및 목표

멀티모달 LLM에서 시각 특성이 계층 간에 어떻게 진화하는지 조사하고 환각의 원인을 식별한다.
중간 계층에서 깊은 계층으로의 주의 drift를 특성화하고 사실적 내용 저하에 대한 초기 계층 잡음의 역할을 규명한다.
훈련 없이 작동하는 완화책(CLVA)을 개발한다.
다양한 아키텍처에 걸친 효과성과 낮은 오버헤드를 환각 벤치마크에서 입증한다.

제안 방법

시각적 근거도(visual grounding intensity) Phi^(l)_h를 사용하여 시각적으로 민감한 헤드와 민감하지 않은 헤드를 구분하기 위해 교차 모달 주의를 분석한다.
중간 계층에서 시각적으로 민감한 헤드로부터 양의 앵커를 정의하고 초기 계층에서 시각적으로 민감하지 않은 헤드로부터 음의 앵커를 정의한다.
Z-점수 기반 이상치 탐지를 통해 시각 앵커 마스크를 계산한다: Z_pos, Z_neg.
태깅된 주의 재앵커는 tilde A(i,j)=A(i,j)*(1+αZ_pos(j)−βZ_neg(j))로 하고 hat A(i,j)를 얻기 위해 재정규화한다.
O=AV를 시각적 구성요소와 언어-사전 영향으로 나눈 이론적 관점을 제공하고, CLVA가 고충실도의 시각 증거 쪽으로 균형을 이동시키는 방식을 보인다.

실험 결과

연구 질문

RQ1왜 심층 계층의 주의 drift가 MLLMs에서 사실적 근거를 감소시키는가?
RQ2훈련 없이 교차 계층 앵커링 전략이 아키텍처 전반의 환각을 완화할 수 있는가?
RQ3다양한 LVLM 백본 및 벤치마크에서 CLVA의 효과는 어느 정도인가?
RQ4디코딩 중 CLVA를 적용할 때의 계산 및 메모리 오버헤드는 어느 정도인가?

주요 결과

심층 계층의 주의 drift가 초기 계층의 시각 노이즈로 되돌아가는 경향을 보여주어 사실적 근거를 약화시킨다.
중간 계층은 작업 관련 영역을 정확히 로컬화하는 Positive Visual Anchors를 생성한다.
CLVA는 Positive Anchors를 강화하고 Negative Anchors를 억제하여 심층 계층의 근거를 회복한다.
CLVA는 여러 모델과 아키텍처에 걸친 환각 벤치마크에서 최소한의 오버헤드로 향상을 보인다.
분해 실험에서 POS 앵커와 NEG 앵커 모두 효과에 필수적임을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.