QUICK REVIEW

[論文レビュー] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch|arXiv (Cornell University)|Nov 29, 2018

Face Recognition and Perception被引用数 818

ひとこと要約

本論文はImageNetで訓練されたCNNが形状よりも質感に依存することを示し、形状ベースの表現を誘導するStylized-ImageNetを導入し、転移学習における物体検出の改善を含む、精度と頑健性の向上を実証する。

ABSTRACT

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

研究の動機と目的

質感と形状の手掛かりを衝突させた画像を用いて、CNNと人間の質感と形状のバイアスを定量化する。
Stylized-ImageNetがCNNを形状ベースの表現へシフトさせ得ることを示す。
タスクと歪みの幅広い条件における形状バイアスモデルの頑健性と転移性能を評価する。

提案手法

スタイル転送を用いて質感-形状の手掛かり衝突画像を作成し、人間とCNNの分類を比較する。
Stylized-ImageNet上でCNNを訓練し、質感手掛かりを抑制して形状ベースの表現を奨励する。
複数のアーキテクチャに対して手掛かり衝突性能を評価し、形状対質感のバイアスを測定する。
IN、SIN、Shape-ResNet系列を比較して、さまざまな歪み・劣化に対する頑健性をテストする。
パスカルVOC 2007とMS COCOでの転移性能を、Faster R-CNNをバックボーンとして分析する。

実験結果

リサーチクエスチョン

RQ1ImageNetで訓練されたCNNは、人間と比較して質感を形状よりも優先的に依存しているのだろうか？
RQ2Stylized-ImageNetでの訓練は、CNNの表現を質感から形状へシフトさせることができるか？
RQ3形状ベースの表現は、歪みへの頑健性と物体検出の転移性能を改善するか？

主な発見

人間は手掛かり衝突画像で形状バイアスを示す一方、CNNは強い質感バイアスを示す。
Stylized-ImageNetで訓練されたResNet-50は、形状バイアスへ大きくシフトし（最大81%）、多くのカテゴリで人間レベルのバイアスに近づく。
SINで訓練されたモデルは歪み・腐敗のベンチマークに対する頑健性を向上させ、条件によっては人間の性能に匹敵または上回ることが多い。
SIN（またはShape-ResNet）を組み込むとImageNetのTop-1/Top-5精度が向上し、Pascal VOC 2007とMS COCOの物体検出のmAP50が向上する。
SINとINの共同訓練は、INでのオプションのファインチューニングを含むと全体の検出性能で最も良い結果を生む（Pascal VOC 2007で75.1 mAP50；MS COCOで55.2 mAP50）。
SINで訓練された形状ベースの表現はImageNetへの一般化が良く、データセット間の転移を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。