QUICK REVIEW

[논문 리뷰] TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Kan Wu, Jinnian Zhang|arXiv (Cornell University)|2022. 07. 21.

Advanced Neural Network Applications인용 수 22

한 줄 요약

TinyViT는 큰 사전 학습 교사를 통해 지식을 전이하여 작은 비전 트랜스포머를 빠르게 사전 학습하는 증류 프레임워크를 도입하여 훨씬 적은 매개변수로도 ImageNet 및 다운스트림 작업에서 강력한 성능을 달성합니다.

ABSTRACT

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

연구 동기 및 목표

자원이 제한된 디바이스를 위한 효율적인 작은 비전 트랜스포머 개발의 동기를 제시한다.
증류를 통해 대규모 사전 학습 데이터의 이점을 작은 ViT가 활용하도록 한다.
사전 학습 증류를 위한 학습 메모리 및 계산 비용을 줄인다.
전이 성능을 유지하면서 작은 ViT를 사전 학습하고 압축하는 확장 가능한 프레임워크를 제안한다.

제안 방법

빠른 사전 학습 증류를 가능하게 하기 위해 반복적인 교사 순전파를 제거하고, 디스크에 희소 교사 로짓과 데이터 증가 메타데이터를 저장한다.
저장된 교사 출력으로부터 회수된 희소 소프트 레이블을 사용하여 증류 손실로 작은 학생 ViT를 훈련한다.
라벨 없는(distillation) 설정을 활용하여 ground-truth 레이블이 아닌 소프트 교사 예측을 활용한다.
매개변수 및 처리량 제약 하에 큰 시드 ViT를 점진적으로 수축시켜 TinyViT 모델 패밀리를 생성한다.
윈도우형 어텐션과 MBConv 블록을 갖춘 계층적 Swin-유사 아키텍처를 채택해 정확도와 효율성 사이의 균형을 맞춘다.
ImageNet-21k에서 사전 학습하고 ImageNet-1k에서 미세 조정하며, 선택적으로 더 높은 해상도 미세 조정을 통해 정확도를 향상시킨다.

실험 결과

연구 질문

RQ1작은 비전 트랜스포머가 대규모 사전 학습 모델로부터 지식을 증류해 사전 학습 동안 경쟁력 있는 성능을 달성할 수 있는가?
RQ2대형 교사를 메모리/시간 부담 없이 빠르고 확장 가능하게 만드는 증류 방법은 무엇인가?
RQ3사전 학습 증류가 TinyViT의 다운스트림 작업 전이성에 어떤 영향을 미치는가?
RQ4점진적 모델 수축이 TinyViT의 정확도/효율성 트레이드오프에 어떤 영향을 미치는가?

주요 결과

TinyViT-21M은 IN-21k 사전 학습 후 IN-1k 30에폭 미세 조정으로 ImageNet-1k에서 84.8% top-1을 달성한다.
더 높은 입력 해상도에서 TinyViT는 86.5% top-1에 도달해 Swin-L를 약간 상회하며 매개변수의 약 11%만 사용한다.
TinyViT-21M은 IN-21k에서의 증류로 사전 학습되면 COCO 객체 탐지의 AP가 50.2로 Swin-T의 28M 매개변수 대비 2.1포인트 더 높게 다운스트림 작업으로 잘 전이된다.
빠른 사전 학습 증류 프레임워크는 희소 교사 로짓을 저장하고 데이터 증가를 인코딩하여 대형 배치 증류를 가능하게 하며 학습 중에 교사를 불러오지 않아도 된다.
더 높은 품질의 교사 모델(Florence, CLIP-ViT-L/14 등)을 사용할 때 TinyViT의 성능이 더 향상되면서도 디스크 기반 로짓으로 인한 실용적 학습 비용을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.