QUICK REVIEW

[논문 리뷰] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch|arXiv (Cornell University)|2018. 11. 29.

Face Recognition and Perception인용 수 818

한 줄 요약

이 논문은 ImageNet에서 학습된 CNN이 형상보다 질감에 의존한다는 것을 보이고, 질감 기반 표현을 유도하기 위해 Stylized-ImageNet를 도입하며, 전이 학습에서의 객체 탐지 포함 정확도와 견고성 향상을 입증한다.

ABSTRACT

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

연구 동기 및 목표

질감 대 형상 큐 비교 이미지를 사용하여 CNN과 인간의 질감-형상 편향을 정량화한다.
Stylized-ImageNet가 CNN을 형상 기반 표현으로 이동시킬 수 있음을 보여준다.
다양한 작업과 왜곡에 걸친 형상 편향 모델의 강건성과 전이 성능을 평가한다.

제안 방법

스타일 전송을 통해 질감-형상 큐 충돌 이미지를 만들어 인간과 CNN의 분류를 비교한다.
질감 큐를 억제하고 형상 기반 표현을 촉진하기 위해 Stylized-ImageNet에서 CNN을 학습시킨다.
다양한 아키텍처에 대해 큐-충돌 성능을 평가하여 형상 대 질감 편향을 측정한다.
IN, SIN, Shape-ResNet 변형을 비교하여 왜곡과 손상에 대한 강건성을 테스트한다.
Faster R-CNN을 백본으로 사용하여 Pascal VOC 2007 및 MS COCO에서 전이 성능을 분석한다.

실험 결과

연구 질문

RQ1ImageNet에서 학습된 CNN이 인간에 비해 질감을 형상보다 선호하는가?
RQ2Stylized-ImageNet 학습이 CNN 표현을 질감에서 형상으로 이동시킬 수 있는가?
RQ3형상 기반 표현이 왜곡에 대한 강건성과 객체 탐지의 전이 성능을 개선하는가?

주요 결과

인간은 큐-충돌 이미지에서 형상 편향을 보이는 반면, CNN은 강한 질감 편향을 보여준다.
Stylized-ImageNet에서 학습된 ResNet-50은 형상 편향으로 큰 전환을 보이며(최대 81%까지), 많은 범주에서 인간 수준의 편향에 근접한다.
SIN으로 학습된 모델은 왜곡과 손상 벤치마크에 대한 강건성이 향상되며, 일부 조건에서 인간 성능과 비슷하거나 우수한 경우가 많다.
SIN(또는 Shape-ResNet)을 통합하면 ImageNet top-1/top-5 정확도가 향상되고 Pascal VOC 2007 및 MS COCO에서 객체 탐지 mAP50이 증가한다.
옵션으로 IN에서 미세 조정된 SIN 및 IN 함께 하는 공동 학습이 최고의 전체 탐지 성능을 보이며(Pascal VOC 2007 75.1 mAP50; MS COCO 55.2 mAP50), 최상의 조합을 제공한다.
SIN에서 학습된 형상 기반 표현은 ImageNet으로 일반화되며 교차 데이터셋 전이 성능을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.