QUICK REVIEW

[论文解读] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch|arXiv (Cornell University)|Nov 29, 2018

Face Recognition and Perception被引用 664

一句话总结

ImageNet 的卷积神经网络依赖纹理而非形状；在 Stylized-ImageNet 上进行训练会引入形状偏向，从而提高准确性和鲁棒性，展示了超越基于纹理的表示的好处。

ABSTRACT

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

研究动机与目标

评估 ImageNet 训练的 CNN 在物体识别中更依赖纹理还是形状。
使用纹理-形状 cue-conflict 刺激品定量比较人类和 CNN 的纹理偏向与形状偏向。
探究在 Stylized-ImageNet 上训练是否能够将 CNN 转向以形状为基础的表示。
评估形状偏向如何影响分类性能以及对扭曲和迁移任务的鲁棒性。

提出的方法

使用通过风格迁移生成的纹理-形状 cue-conflict 图像，在相同刺激上比较人类和 CNN 的分类。
在 ImageNet 与 Stylized-ImageNet (SIN) 上训练 CNN（如 ResNet-50 及其他），以评估偏向的转变。
在原始、灰度、轮廓、边缘、纹理和 cue-conflict 图像上评估性能。
测试 SIN 与 IN 联合训练方案，以创建形状增强的架构（Shape-ResNet）。
评估对常见扭曲和损坏的鲁棒性，包括 ImageNet-C 风格的扰动。

实验结果

研究问题

RQ1在 ImageNet 训练的 CNN 是否表现出纹理偏向，而人类观察者偏好形状？
RQ2在 Stylized-ImageNet 上的训练是否能将 CNN 转向以形状为基础的表示并减少纹理偏向？
RQ3与纹理偏向模型相比，形状偏向模型是否提高物体检测和对扭曲的鲁棒性？
RQ4将 SIN 与 IN 数据结合是否能进一步提升准确性和鲁棒性，以及它如何迁移到下游任务？

主要发现

在 cue-conflict 图像上，ImageNet 训练的 CNN 显示出强烈的纹理偏向，而人类主要依赖形状。
在 Stylized-ImageNet 上的训练显著增加了 CNN 的形状偏向（例如在 ResNet-50 中形状偏向从 22% 提升到 81%）。
SIN 训练的模型对 IN 的泛化较差，但 SIN 特征对 ImageNet 的迁移良好，表明来自形状聚焦表示的收益。
Shape-ResNet（SIN+IN 并对 IN 进行微调）在 ImageNet top-1/top-5 精度上高于原生 ResNet，并在目标检测（Pascal VOC 和 MS COCO）上表现更佳。
SIN 训练的网络对广泛的扭曲具有更强的鲁棒性，在许多扰动下接近或超过人类水平。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。