[論文レビュー] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
The paper introduces VI-Probe, a controllable visual-illusion framework that disentangles perception from memory in large VLMs, revealing heterogeneous, model-family-specific causes of response persistence under visual illusions.
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.
研究の動機と目的
- Motivate a fine-grained, perturbation-based evaluation to distinguish visual perception from language-driven recall in VLMs.
- Create a controllable benchmark with graded visual perturbations and matched controls to isolate illusion effects.
- Develop metrics that quantify stability and sensitivity beyond static accuracy (perception vs. memory).
- Characterize how different VLM families reason under illusion, revealing heterogeneous failure modes.
- Provide design guidance for evaluation and model development to balance perception and memory.
提案手法
- Design VI-Probe with 27 classic visual illusions grouped by size, color, and orientation.
- Generate Original, Perturbed, Visual Control, and Optional Hinted images to isolate illusion effects.
- Pair each image with forward, reversed, and instructional prompts to probe language priors.
- Introduce metrics: Polarity-Flip Consistency (PFC), Template Fixation Index (TFI), and illusion multiplier R to separate perception from memory.
- Compute R as the illusion effect divided by the control effect to attribute degradation to memory vs. visual processing.
- Evaluate 15 VLM families across OpenAI, Anthropic, Google, and Qwen models using API-based evaluation.
実験結果
リサーチクエスチョン
- RQ1Do VLMs flip their predictions when visual evidence contradicts prior knowledge (flip sensitivity)?
- RQ2Are model responses under perturbations driven by memory templates or actual visual perception?
- RQ3How do visual cues, linguistic prompts, and model architecture influence perception-memory balance?
- RQ4Can perturbation-based, counterfactual evaluation reveal model-specific failure modes beyond average accuracy?
- RQ5What design directions can improve perception-grounded visual reasoning in VLMs?
主な発見
- Response persistence under illusions stems from heterogeneous causes: memory override, perception–memory competition, and visual-processing limits.
- High polarity-consistency (PFC) does not guarantee high accuracy; models can be linguistically robust yet visually wrong (CbW).
- Model families show distinct mechanisms: GPT-5 and Gemini-2.5-Flash exhibit memory override; Claude variants show perception–memory competition; Qwen variants show visual-processing limits.
- Illusion effects are masked by image-level averages; the Illusion multiplier R reveals memory vs. perception contributions across models (e.g., GPT-5 R=1.97; Qwen R<1; Haiku-4.5 showing perception-first behavior).
- Smaller models can outperform larger ones on perception tasks, indicating architecture/training choices matter beyond scale.
- Visual hints tend to boost Original (visual) accuracy but often degrade Perturbed accuracy, suggesting hints reinforce template retrieval rather than flexible perception
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。