QUICK REVIEW

[논문 리뷰] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Xiaoxiao Sun, Yanyan Liu|arXiv (Cornell University)|2026. 01. 29.

Face Recognition and Perception인용 수 0

한 줄 요약

논문은 VI-Probe를 소개합니다. 이는 시각적 착시 아래 응답 지속성의 원인을 모델-패밀리별로 드러내는 대형 VLM의 지각과 기억을 분리하는 제어 가능한 시각-착시 프레임워크입니다.

ABSTRACT

Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

연구 동기 및 목표

VLM에서 시각 지각과 언어 기반 회상을 구분하기 위한 미세한 perturbation 기반 평가를 동기화합니다.
착시 효과를 고립시키기 위한 등급이 매겨진 시각 교란과 매칭 컨트롤이 포함된 제어 가능한 벤치마크를 생성합니다.
정적 정확도(지각 vs. 기억)를 넘어서는 안정성과 민감도를 정량화하는 지표를 개발합니다.
다른 VLM 패밀리의 추론 방식 차이를 캐릭터화하여 이질적인 실패 모드를 드러냅니다.
지각과 기억의 균형을 맞추기 위한 평가 및 모델 개발을 위한 설계 지침을 제공합니다.

제안 방법

27 가지 고전적 시각 착시를 크기, 색상, 방향으로 그룹화하여 VI-Probe를 설계합니다.
Illusion 효과를 고립시키기 위해 Original, Perturbed, Visual Control, Optional Hinted 이미지를 생성합니다.
각 이미지에 순방향, 역방향, 지시형 프롬프트를 매치하여 언어 편향을 탐색합니다.
PFC(Polarity-Flip Consistency), TFI(Template Fixation Index), and illusion multiplier R를 도입하여 지각과 기억을 분리합니다.
R을 illusion effect를 control effect로 나누어 기억 vs. 시각 처리의 저하를 귀속합니다.
OpenAI, Anthropic, Google, Qwen 모델의 API 기반 평가를 통해 15개 VLM 패밀리를 평가합니다.

실험 결과

연구 질문

RQ1시각적 증거가 이전 지식과 모순될 때 VLM이 예측을 뒤집습니까(플립 민감도)?
RQ2교란하에 모델의 응답은 기억 템플릿에 의해 좌우되나요, 실제 시각 지각에 의해 좌우되나요?
RQ3시각적 단서, 언어 프롬프트, 모델 아키텍처가 지각-기억 균형에 어떤 영향을 미치나요?
RQ4 perturbation 기반의 반사실적 평가가 평균 정확도를 넘어 모델-specific 실패 모드를 드러낼 수 있나요?
RQ5VLM의 시각 추론을 향상시키기 위한 설계 방향은 무엇인가요?

주요 결과

착시 하에서 응답 지속성은 이질적인 원인에서 비롯됩니다: 기억 재우침, 지각–기억 경쟁, 시각 처리 한계.
높은 polarity-consistency(PFC)가 항상 높은 정확도를 보장하지는 않습니다; 모델은 언어적으로 강건하지만 시각적으로는 잘못될 수 있습니다(CbW).
모델 패밀리는 서로 다른 메커니즘을 보여줍니다: GPT-5와 Gemini-2.5-Flash는 기억 재우침을 보이고; Claude 계열은 지각–기억 경쟁을 보이며; Qwen 계열은 시각 처리 한계를 보입니다.
ILLUSION 효과는 이미지 수준 평균으로 가려지며; Illusion multiplier R은 모델 간 기억 대 지각 기여를 드러냅니다(예: GPT-5 R=1.97; Qwen R<1; Haiku-4.5는 지각-우선 행동을 보임).
더 작은 모델이 지각 작업에서 더 큰 모델을 능가할 수 있음이 나타나며, 이는 규모 외의 아키텍처/학습 선택이 중요하다는 것을 시사합니다.
시각적 힌트는 Original(시각) 정확도를 높이는 경향이 있지만 Perturbed 정확도를 악화시키는 경우가 많아, 힌트가 템플릿 회수를 강화하고 유연한 지각을 저해함을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.