QUICK REVIEW

[论文解读] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Xiaoxiao Sun, Yanyan Liu|arXiv (Cornell University)|Jan 29, 2026

Face Recognition and Perception被引用 0

一句话总结

本文提出 VI-Probe，一种可控的视觉错觉框架，在大规模VLM中将感知与记忆解耦，揭示在视觉错觉下响应持久性的异质性、模型家族特定的原因。

ABSTRACT

Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

研究动机与目标

推动一个基于扰动的细粒度评估，以区分 VLMs 的视觉感知与语言驱动的回忆。
创建一个具有分级视觉扰动和匹配对照的可控基准，以 isolating иллюзion 效应。
开发度量方法，量化相对于静态准确率的稳定性和敏感性（感知 vs. 记忆）。
描述不同VLM家族在错觉下的推理方式，揭示异质的失败模式。
为评估和模型开发提供设计指引，以在感知与记忆之间取得平衡。

提出的方法

设计 VI-Probe，包含27个经典视觉错觉，按尺寸、颜色和方向分组。
生成 Original、Perturbed、Visual Control 和 Optional Hinted 图像，以隔离错觉效应。
将每个图像配对前向、反向和指令性提示，以探测语言偏好。
引入度量：Polarity-Flip Consistency (PFC)、Template Fixation Index (TFI) 和 illusion multiplier R，以将感知与记忆分离。
将 R 计算为错觉效应除以对照效应，以将降级归因于记忆还是视觉处理。
在 OpenAI、Anthropic、Google 和 Qwen 系列的 15 个 VLM 家族上进行基于 API 的评估。

实验结果

研究问题

RQ1当视觉证据与先验知识相矛盾时，VLM 是否会翻转其预测（翻转敏感性）？
RQ2扰动下的模型响应是由记忆模板驱动，还是由实际视觉感知驱动？
RQ3视觉线索、语言提示和模型架构如何影响感知-记忆平衡？
RQ4基于扰动的反事实评估是否能揭示超越平均准确率的模型特定失败模式？
RQ5哪些设计方向可以改善VLMs中的感知为基础的视觉推理？

主要发现

错觉下的响应持久性源于多种原因：记忆覆盖、感知–记忆竞争和视觉处理能力的限制。
高极性一致性（PFC）并不保证高准确率；模型可能在语言上鲁棒却在视觉上错误（CbW）。
模型家族显示出不同的机制：GPT-5 和 Gemini-2.5-Flash 表现出记忆覆盖；Claude 变体表现出感知–记忆竞争；Qwen 变体表现出视觉处理的极限。
错觉效应在图像级平均值中被掩蔽；Illusion multiplier R 展示跨模型的记忆与感知贡献（例如 GPT-5 R=1.97；Qwen R<1；Haiku-4.5 显示感知优先行为）。
较小的模型在感知任务上可能胜过较大的模型，表明架构/训练选择的重要性超越规模。
视觉提示通常提升 Original（视觉）准确性，但往往降低 Perturbed 的准确性，表明提示加强了模板检索而非灵活感知。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。