Skip to main content
QUICK REVIEW

[论文解读] When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Qianpu Chen, Derya Soydaner|arXiv (Cornell University)|Mar 4, 2026
Face Recognition and Perception被引用 0
一句话总结

论文提出一个统一的基于错视的诊断框架,用于分析 FacesInThings 的六种模型在模糊面部样刺激下的检测、定位、不确定性和偏差,涵盖四种表征范式。

ABSTRACT

When visual evidence is ambiguous, vision models must decide whether to interpret face-like patterns as meaningful. Face pareidolia, the perception of faces in non-face objects, provides a controlled probe of this behavior. We introduce a representation-level diagnostic framework that analyzes detection, localization, uncertainty, and bias across class, difficulty, and emotion in face pareidolia images. Under a unified protocol, we evaluate six models spanning four representational regimes: vision-language models (VLMs; CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). Our analysis reveals three mechanisms of interpretation under ambiguity. VLMs exhibit semantic overactivation, systematically pulling ambiguous non-human regions toward the Human concept, with LLaVA-1.5-7B producing the strongest and most confident over-calls, especially for negative emotions. ViT instead follows an uncertainty-as-abstention strategy, remaining diffuse yet largely unbiased. Detection-based models achieve low bias through conservative priors that suppress pareidolia responses even when localization is controlled. These results show that behavior under ambiguity is governed more by representational choices than score thresholds, and that uncertainty and bias are decoupled: low uncertainty can signal either safe suppression, as in detectors, or extreme over-interpretation, as in VLMs. Pareidolia therefore provides a compact diagnostic and a source of ambiguity-aware hard negatives for probing and improving the semantic robustness of vision-language systems. Code will be released upon publication.

研究动机与目标

  • 引入一个紧凑的错视诊断流程,用于研究在歧义下的检测、定位、不确定性和偏差。
  • 将诊断框架应用于 FacesInThings 数据集上六个模型,覆盖四种表征范式。
  • 表征模糊性、情绪与难度如何调节模型行为与偏差。
  • 显示不确定性与偏差是解耦的,并且取决于表征先验而非阈值。

提出的方法

  • 以 FacesInThings 作为错视刺激集,人工标注的面部样区域分为五个粗类(Human、Animal、Cartoon、Alien、Other)。
  • 评估覆盖四种范式的六个模型:CLIP-B/32、CLIP-L/14、LLaVA-1.5-7B、ViT-B/16、YOLOv8、和 RetinaFace。
  • 将模型预测映射到一个通用五类空间,并使用放宽的 IoU(≥0.2)或中心包含规则将预测与真实区域对齐。
  • 计算核心指标:Detection Rate、Primary Pareidolia Detection Rate (PPDR)、Representation Ambiguity Index (RAI)、False Bias Score (FBS)、以及图像/框级偏差度量。
  • 在 GT-box 控制下进行评估,以将定位与语义门控在检测器中分离。
Figure 1 : Face pareidolia in an electrical outlet. The visual input is unchanged, yet observers may perceive a face, illustrating how interpretation emerges under ambiguity.
Figure 1 : Face pareidolia in an electrical outlet. The visual input is unchanged, yet observers may perceive a face, illustrating how interpretation emerges under ambiguity.

实验结果

研究问题

  • RQ1不同模型家族在对模糊的错视刺激下如何分配语义证据?
  • RQ2推动错视反应的不同机制(偏差、不确定性、先验)在视觉-语言模型、纯视觉模型与检测器中有何差异?
  • RQ3情绪和难度如何影响各模型的错视偏差?
  • RQ4在模糊性下,不确定性是否能作为不同模型类型的语义安全性可靠预测?
  • RQ5错视是否可作为提升视觉与视觉–语言系统语义鲁棒性的诊断工具?

主要发现

  • 视觉-语言模型在非人类错视区域对人类有强烈的语义激活,LLaVA 的过度调用最强且最自信,尤其对负面情绪表现突出。
  • 纯视觉模型(ViT)表现为不确定性即弃选,面对歧义时仍然比较分散且基本无偏。
  • 检测器(YOLOv8、RetinaFace)通过强先验表现出低偏差,能够抑制错视,即使在定位受控时亦然。
  • 不确定性与偏差是解耦的;高不确定性不一定代表安全,低不确定性也可能伴随极端过度解读(如在 LLaVA 中)或安全抑制(如检测器)。
  • 情绪影响 VLM 的偏差(负面情绪会增加对 Human 的过度调用),而检测器与纯视觉模型的情绪效应较弱。
Figure 2 : Example images from the FacesInThings dataset [ hamilton2024seeing ] . Red bounding boxes indicate face-like regions perceived by human observers in otherwise inanimate objects.
Figure 2 : Example images from the FacesInThings dataset [ hamilton2024seeing ] . Red bounding boxes indicate face-like regions perceived by human observers in otherwise inanimate objects.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。