QUICK REVIEW

[논문 리뷰] Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Chen Sun, Abhinav Shrivastava|arXiv (Cornell University)|2017. 07. 10.

Advanced Neural Network Applications참고 문헌 40인용 수 303

한 줄 요약

논문은 데이터 규모를 3억 이미지(JFT-300M)로 확장하여 사전 학습 데이터 크기가 시각 표현에 미치는 영향을 연구하고, 데이터가 많아질수록 성능이 로그 스케일로 향상되며 더 높은 용량의 모델이 더 큰 혜택을 얻는다는 것을 보여주고, 여러 작업에서 새로운 SOTA를 달성한다.

ABSTRACT

The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10x or 100x? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between `enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.

연구 동기 및 목표

사전 학습 데이터 크기를 증가시키는 것이 작업 전반에 걸친 시각 표현 학습에 어떤 영향을 미치는지 평가(분류, 탐지, 분할, 자세/포즈).
데이터 양과 성능 간의 관계를 평가하되, 더 높은 용량의 모델 사용 시를 포함.
큰 노이즈의 웹 크롤링 데이터로 사전 학습하여 SOTA를 달성한 것을 입증.
모델 용량, 클래스 수, 데이터 품질과 전이 학습 성능에 미치는 요소를 분석.

제안 방법

18291 개 라벨과 약 20% 라벨 노이즈를 가진 JFT-300M에서 101층 ResNet(ResNet-101)을 훈련.
JFT-300M에서 사전 학습하고 ImageNet, COCO, PASCAL VOC, COCO Pose 벤치마크에서 표현을 파인 튜닝 또는 평가.
다중 레이블 특성으로 per-label 로지스틱 손실 사용 및 누락 라벨을 채우기 위한 라벨 계층 구조 도입.
표현을 정지된 피처 추출(frozen) 및 JFT-300M에서 초기화된 파인 튜닝으로 평가.
데이터 크기, 클래스 수, 모델 용량에 대해 ImageNet 기준선과 비교하고 차등 분석.
Downpour SGD 및 파라미터 서버를 이용한 50 GPU의 비동기 분산 학습.

실험 결과

연구 질문

RQ1대용량 모델을 사용할 때 사전 학습 데이터 크기를 늘리는 것이 비전 태스크에서 성능 향상을 가져오는가?
RQ2표현 품질은 데이터 볼륨에 따라 로그스케일 vs 선형 증가, 그리고 모델 용량에 따라 어떻게 확장되는가?
RQ3클래스 수와 라벨 노이즈가 전이 학습 성능에 어떤 영향을 미치는가?
RQ4더 큰 기본 모델이 대규모 데이터세트에서 더 큰 이점을 얻는가?
RQ5데이터의 품질(노이즈)과 양 중 어떤 요소가 다운스트림 작업의 성능 향상에 더 기여하는가?

주요 결과

방법	mAP@0.5	mAP@[0.5,0.95]
He et al. [16]	53.3	32.2
ImageNet	53.6	34.3
300M	56.9	36.7
ImageNet+300M	58.0	37.4
Inception ResNet [38]	56.3	35.5

시각 태스크의 성능은 더 큰 사전 학습 데이터와 함께 향상되며, 데이터 규모가 커질수록 이익이 로그적으로 증가한다.
대규모 데이터에서의 더 나은 표현 학습이 탐지, 분할, 자세 추정과 같은 다운스트림 작업을 크게 향상시킨다.
모델 용량이 중요하다; 더 높은 용량의 모델(예: ResNet-152)이 300M 데이터에서 더 큰 이점을 얻는다.
롱테일 데이터로 학습하는 것이 수렴을 방해하지 않으며 여전히 정확도를 높인다.
COCO 탐지, PASCAL VOC, 의미론적 분할, 인간 자세 추정에서 JFT-300M 사전 학습으로 새로운 SOTA를 달성했다.
JFT-300M 초기화에서 파인 튜닝은 여러 벤치마크에서 ImageNet 초기화보다 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.