QUICK REVIEW

[논문 리뷰] The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

Tianlong Chen, Jonathan Frankle|arXiv (Cornell University)|2020. 12. 12.

Advanced Neural Network Applications참고 문헌 87인용 수 31

한 줄 요약

이 논문은 매칭 서브네트워크가 사전 학습된 컴퓨터 비전 모델(지도 학습 및 자기지도 학습) 내에 존재하여 성능 손실 없이 다양한 다운스트림 태스크로 전이될 수 있는지 여부를 조사한다. 분류, 검출, 및 분할 전반에서 큰 희소성 하에 보편적으로 전이 가능한 티켓을 보여준다.

ABSTRACT

The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its downstream transferability? In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch yet still reach the full models' performance. We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_LTH_Pre-training.

연구 동기 및 목표

사전 학습된 CV 모델에서 다운스트림 전이 성능을 보존하는 매칭 서브네트워크가 존재하는지 평가한다.
다양한 다운스트림 태스크(분류, 검출, 분할) 간에 전이 가능한 보편적 서브네트워크가 존재하는지 확인한다.
감독 학습과 자기지도 학습으로부터 얻은 서브네트워크를 전이성과 구조 민감도 측면에서 비교한다.

제안 방법

사전 학습된 가중치를 서브네트워크의 초기화로 간주한다.
반복적 크기 가지치기(IMP)를 적용하여 매칭 서브네트워크를 식별한다.
같은 학습 규칙 하에서 전이 성능이 전체 사전 학습 모델과 같거나 더 우수한 서브네트워크를 매칭 서브네트워크로 정의한다.
여러 다운스트림 태스크 및 데이터셋(분류, 검출, 분할) 전반에 걸친 서브네트워크의 전이성을 평가한다.
사전 학습 유형(ImageNet, simCLR, MoCo) 간 마스크 다양성과 섭동 민감도를 분석한다.
더 큰 사전 학습 모델과 온도 설정이 전이성에 미치는 영향을 탐구한다.

실험 결과

연구 질문

RQ1사전 학습 과제에서 발견된 승리하는 티켓이 다운스트림 태스크에서도 승리하는 티켓으로 작용하는가?
RQ2다양한 다운스트림 태스크 간에 서로 다른 사전 학습 방식에서 초기화되었을 때 보편적이고 전이 가능한 서브네트워크가 존재하는가?
RQ3감독 학습과 자기지도 학습으로부터의 서브네트워크는 전이성과 마스크 구조 측면에서 어떻게 비교되는가?

주요 결과

지도 학습 ImageNet, simCLR, MoCo 사전 학습에서 각각 67.23%, 59.04%, 95.60%의 희소성으로 승리하는 티켓이 존재한다.
사전 학습으로부터 얻은 서브네트워크는 CIFAR-10, CIFAR-100, SVHN, Fashion-MNIST에서 약 86.58%–91.41%의 희소성으로 다양한 다운스트림 분류 태스크에 보편적으로 전이되며, VisDA2017은 더 높은 용량이 필요하다(약 67.23%–59.04%).
사전 학습으로부터 전이된 서브네트워크가 다운스트림 태스크에서 직접 찾은 서브네트워크보다 더 나은 성능을 발휘할 수 있다(예: 검출 및 분할에 대해 95.60%/93.13%/97.75%의 희소성).
MoCo 사전 학습으로부터의 서브네트워크가 검출/분할로의 전이에서 가장 우수한 반면, ImageNet와 simCLR은 다운스트림 태스크와 희소성에 따라 다른 강점을 보인다.
사전 학습으로부터 식별된 서브네트워크는 다양한 마스크 구조와 섭동 민감도를 보이며, 다섯 차례의 IMP 후 서로 다른 유형 간 마스크 중복이 6.55% 미만이다.
더 큰 사전 학습 모델의 가지치기는 자기지도 학습(simCLR)에서 더 나은 전이 가능한 서브네트워크를 제공하며, CIFAR-100 결과에서 ResNet-50 대 ResNet-152 비교에서 이를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.