QUICK REVIEW

[논문 리뷰] Partial success in closing the gap between human and machine vision

Robert Geirhos, Kantharaju Narayanappa|arXiv (Cornell University)|2021. 06. 14.

Domain Adaptation and Few-Shot Learning참고 문헌 94인용 수 63

한 줄 요약

이 연구는 현대 모델이 특히 데이터가 풍부하고 트랜스포머 기반일수록 OOD 왜곡에 대한 강건성이 인간에 의해 점점 더 잘 따라잡히거나 능가되고 있음을 보여주지만, 이미지 수준의 오류 패턴에서 인간과 기계 간의 차이는 여전히 남아 있다. 17개의 OOD 데이터셋과 85,120개의 트라이얼로 구성된 대규모 심리물리 벤치마크는 다양한 모델 계열을 평가하여 인간과 유사한 비전으로의 진전을 정량화한다.

ABSTRACT

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/

연구 동기 및 목표

인간과 기계 비전 간의 강건성 격차가 out-of-distribution 데이터에서 좁혀지고 있는지 여부를 평가한다.
다양한 ML 발전(목표 함수, 아키텍처, 데이터 크기)이 인간–기계 정렬에 어떤 영향을 미치는지 평가한다.
향후 이 영역의 진전을 추적할 수 있는 벤치마크 도구상자와 데이터셋을 제공한다.

제안 방법

17개의 OOD 데이터셋에서 왜곡 강건성을 테스트하도록 설계된 90명의 인간 관찰자로부터 85,120개의 심리물리적 트라이얼을 수집하였다.
CNN, 자가지도학습, 적대적 훈련, 비전 트랜스포머, 대규 데이터/노이즈 라벨 체계 등 52개의 모델을 비교하였다.
모델을 OOD 정확도와 세 가지 정렬 지표로 평가했다: Accuracy difference A(m), Observed consistency O(m), 및 Error consistency E(m).
모델-대-인간 벤치마크 도구를 열어 새로운 모델을 인간 데이터와 비교하였다.
WordNet 계층 구조를 이용해 인간-모델 비교 가능성을 위해 ImageNet 1000 클래스를 16개의 범주로 매핑하였다.

실험 결과

연구 질문

RQ1현대 ML 모델이 광범위한 OOD 조건에서 인간과의 왜곡 강건성 격차를 좁히고 있는가?
RQ2목표 함수, 아키텍처 및 학습 데이터 규모가 이미지 전반에 걸친 인간–기계 정렬에 어떤 영향을 미치는가?
RQ3OOD 조건에서 개별 이미지에 대해 머신과 인간이 오류 패턴을 공유하는지, 아니면 차이가 있는가?

주요 결과

대규모 데이터로 학습된 최상위 모델이 대부분의 OOD 데이터셋에서 인간의 피드포워드 정확도에 필적하거나 이를 능가한다.
여전히 상당한 이미지 수준의 일관성 격차가 남아 있다: 모델과 인간은 종종 서로 다른 이미지에서 실수하지만, 데이터가 풍부한 모델은 일부 데이터셋에서 이 격차를 좁힐 수 있다.
자가지도 학습 모델은 감독된 베이스라인에 비해 강건성 이득이 제한적이며, 주목할 만한 개선은 주로 데이터 증강 선택에 기인한다.
적대적 훈련 모델은 강건성을 높이나 비대적 교란에는 더 취약해지고 질감 편향이 더 강하게 나타날 수 있다.
비전 트랜스포머와 대규모 데이터는 OOD 성능을 크게 개선하며, CLIP은 일부 지표에서 인간에 거의 가까운 오류 패턴을 달성한다.
이 논문은 향후 진행 상황을 벤치마크하고 인간–기계 행동 정렬을 정량화하는 17개의 OOD 데이터셋과 도구상자를 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.