QUICK REVIEW

[論文レビュー] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch|arXiv (Cornell University)|Nov 29, 2018

Face Recognition and Perception被引用数 664

ひとこと要約

ImageNet の CNN は形状より質感に依存しており、Stylized-ImageNet での学習は形状バイアスを誘発し、それが精度と頑健性を向上させ、質感ベースの表現を超える利点を示す。

ABSTRACT

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

研究の動機と目的

ImageNet で訓練された CNN が物体認識において質感より形状に依存しているかを評価する。
質感と形状の手掛かり対立刺激（texture-shape cue-conflict stimuli）を用いて、人間と CNN の質感と形状のバイアスを定量的に比較する。
Stylized-ImageNet での訓練が CNN を形状ベースの表現へ移行させるかを調査する。
形状バイアスが分類性能と変形・転送タスクへの頑健性に与える影響を評価する。

提案手法

同一刺激に対する人間と CNN の分類を比較するため、スタイル転写で生成された texture-shape cue-conflict 画像を用いる。
CNNs（ResNet-50 など）を ImageNet と Stylized-ImageNet (SIN) で訓練し、バイアスの変化を評価する。
オリジナル、グレースケール、シルエット、エッジ、テクスチャ、および cue-conflict 画像での性能を評価する。
Shape-enhanced アーキテクチャ（Shape-ResNet）を作成するため、SIN と IN の結合訓練設定をテストして、shape 強化アーキテクチャを作成する（Shape-ResNet）。
ImageNet-C スタイルの撹乱を含む一般的な歪み・破損に対する頑健性を評価する。

実験結果

リサーチクエスチョン

RQ1ImageNet で訓練された CNN は質感バイアスを示すのに対し、人間の観察者は形状を好むのか。
RQ2Stylized-ImageNet での訓練は CNN を形状ベースの表現へ移行させ、質感バイアスを低減できるか。
RQ3形状バイアスを持つモデルは、質感バイアスを持つモデルと比べて、物体検出と歪みに対する頑健性を向上させるか。
RQ4SIN と IN のデータを組み合わせると、精度と頑健性がさらに向上するか、下流タスクへどのように転移するか。

主な発見

ImageNet で訓練された CNN は cue-conflict 画像で強い質感バイアスを示す一方、人間は主に形状に依存している。
Stylized-ImageNet での訓練により CNN の形状バイアスが劇的に増加する（例：ResNet-50 の形状バイアスが 22% から 81% へ）。
SIN で訓練したモデルは IN への一般化は乏しいが、SIN の特徴は ImageNet へ良く転移し、形状指向表現の利得を示す。
Shape-ResNet（SIN+IN と IN の微調整を組み合わせたもの）は、通常の ResNet より高い ImageNet top-1/top-5 精度を達成し、物体検出性能（Pascal VOC および MS COCO）も改善する。
SIN 訓練済みのネットワークは、広範な歪みに対してより高い頑健性を示し、多くの摂動で人間レベルの頑健性に近づくか、凌駕する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。