QUICK REVIEW

[论文解读] SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions

Jinzhe Tu, Ruilei Guo|arXiv (Cornell University)|Mar 24, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

本文提出基于字符的视觉错觉数据集 IlluChar，指出高频注意力偏差是多模态大语言模型在错觉识别中的失败模式，并提出插件式感知模块 + 多尺度策略（SMSP）在不重新训练的前提下提升错觉感知能力。

ABSTRACT

Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models' failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs' visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at https://github.com/Tujz2023/SMSP.

研究动机与目标

证明多模态大语言模型易受隐藏模式视觉错觉影响并识别关键失败机制（高频注意力偏差）。
创建 IlluChar，这是一个具有不同尺度和背景的具有挑战性的基于字符的错觉数据集。
提出 SMSP，将感知模块与多尺度策略结合，评估其在不同模型、背景和尺度下的有效性。

提出的方法

构建 IlluChar：在语义背景和噪声背景中嵌入字符（数字、字母、汉字）的错觉数据集。
分析错觉图像，揭示高频注意力偏差是导致失败的机制。
开发 SMSP，由感知模块（高频滤波和空间重采样）和多尺度策略（K 个感知处理变体）组成，为模型提供多重线索。
将感知模块表述为两步过程：（i）在频域进行低通滤波；（ii）在白色画布上居中下采样以模拟远距离观看。
将多种处理变体与原图组合形成 I_SMSP 并将所有输入送入多模态大语言模型。

Figure 1. Top : An illusion image with an emergency signal. The model’s attention is dispersed by the background and fails to detect it, while humans can identify it by adjusting their perception. Bottom : After processing the image to simulate such perceptual adjustments, the model can focus on the

实验结果

研究问题

RQ1在频率内容方面，错觉图像与原图在视觉特征上有何不同？
RQ2高频背景信息如何影响大语言模型的注意力与隐藏模式的识别？
RQ3是否可以在不重新训练的情况下，通过感知感知的插件式策略提升错觉识别性能？
RQ4提出的 SMSP 是否在标准（非错觉）任务上保持性能并在不同模式和尺度上具备泛化能力？

主要发现

IlluChar 在大多数 MLLMs 的错觉识别准确率相较于清晰字符图下降超过 65%。
错觉背景提升了中高频能量；MLLM 的注意力从隐藏字符转向背景（高频注意力偏差）。
SMSP 在六种评估的 MLLMs 和两种背景类型上提升错觉准确率，例如 Qwen3-VL-8B-Instruct 在 IlluChar 总体上的从 13.0% 提升到 84.0%。
感知模块在同时进行高频滤波和空间重采样时，尤其是二者结合，最好地恢复了模型的注意力和识别能力（从 59.6% 提升到 88.3%）。
多尺度策略（K 变体）在大、中、小隐藏模式上的准确率显著提升，K=3 在性能与计算之间取得平衡。
SMSP 在原始非错觉输入上保持或提升性能，并与标准视觉问答任务保持兼容。

Figure 2. Examples across different categories in IlluChar.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。