QUICK REVIEW

[논문 리뷰] ViTKD: Practical Guidelines for ViT feature knowledge distillation

Zhendong Yang, Zhe Li|arXiv (Cornell University)|2022. 09. 06.

Advanced Neural Network Applications인용 수 23

한 줄 요약

이 논문은 Vision Transformers(ViT)에 대한 특징 기반 지식 증류(feature-based knowledge distillation)을 연구하고, 세 가지 실용적 지침을 도출하며, ViTKD를 제안하고, 로그잇 기반 KD를 보완하는 ImageNet-1k에서의 일관된 개선을 보인다.

ABSTRACT

Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%. Moreover, ViTKD and the logit-based KD method are complementary and can be applied together directly. This combination can further improve the performance of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve 77.78%, 83.59%, and 85.41%, respectively. The code is available at https://github.com/yzd-v/cls_KD.

연구 동기 및 목표

ViT 모델용 특징 기반 지식 증류의 동기를 부여하고 이해한다. CNN과 다르게 ViT의 주의(attention) 기반 구조 때문
서로 다른 레이어와 모듈에서 ViT 특징 증류에 효과적인 전략 식별
ViT 특화 증류 방법 개발. ImageNet-1k에서 일관된 개선을 내는
ViTKD가 로그잇 기반 KD 방법과 보완적이며 downstream task에 이롭다는 것을 입증

제안 방법

계층별 ViT 특징 맵과 어텐션 동작을 분석해 증류 가이드라인 설계
얕은 레이어에 대해 mimicking(선형 계층 정렬/상관 행렬)을 조사
깊은 레이어에 대해 생성 기반 증류(masking tokens 및 cross-attention, self-attention 또는 convolutional projector 등과 같은 생성 블록 사용)을 조사
ViTKD를 얕은 레이어의 mimicking과 깊은 레이어의 generation을 결합하는 것으로 정의하고 총 손실 L = L_ori + alpha L_lr + beta L_gen
필요시 적응 계층이 있는 특징 및 생성 타깃에 대해 L2 기반 증류 손실 사용
mask ratio lambda = 0.5 및 이미지넷-1k 실험을 위한 하이퍼파라미터 alpha = 3e-5, beta = 3e-6를 포함한 구현 세부정보 제공

실험 결과

연구 질문

RQ1ViT-특화 특징 증류가 작은 ViT 학생에게 지식을 전달할 때 CNN 기반 특징 증류를 능가할 수 있는가?
RQ2어떤 레이어(얕은 vs 깊은)와 어떤 증류 메커니즘( mimicking vs generation) 가 ViT 특징 증류에 가장 이익을 주는가?
RQ3ViTKD가 로그잇 기반 KD 방법과 보완적이며, 결합하면 성능이 더 향상될 수 있는가?
RQ4증류 전략이 이미지 분류를 넘는 다운스트림 작업에 어떻게 전달되는가? e.g., 객체 탐지

주요 결과

교사	학생	유형	Top-1 정확도	Top-5 정확도
DeiT-Small (80.69)	DeiT-Tiny	-	74.42	92.29
DeiT-Small (80.69)	DeiT-Tiny	Ours (feature)	75.40	92.66
DeiT-Small (80.69)	DeiT-Tiny	Ours+NKD (feature+logit)	76.18	93.14
DeiT III-Small* (82.76)	DeiT-Tiny	-	74.42	92.29
DeiT III-Small* (82.76)	DeiT-Tiny	Ours (feature)	76.06	93.16
DeiT III-Small* (82.76)	DeiT-Tiny	Ours+NKD (feature+logit)	77.78	93.97
DeiT III-Base* (85.48)	DeiT-Small	-	80.55	95.12
DeiT III-Base* (85.48)	DeiT-Small	Ours (feature)	81.95	95.64
DeiT III-Base* (85.48)	DeiT-Small	Ours+NKD (feature+logit)	83.59	96.69
DeiT III-Large* (86.81)	DeiT-Base	-	81.76	95.81
DeiT III-Large* (86.81)	DeiT-Base	Ours (feature)	83.46	96.41
DeiT III-Large* (86.81)	DeiT-Base	Ours+NKD (feature+logit)	85.41	97.39

세 가지 실용적 지침이 도출된다: 얕은 레이어에는 mimicking, 깊은 레이어에는 generation 사용; FFN-out 혹은 MHA-out 특징에 집중하면 FFN-out이 증류에 유리하다; 얕은 레이어 지식은 ViT 증류에 특히 유익하다.
ViTKD는 DeiT-Tiny를 74.42%에서 76.06% Top-1로, DeiT-Small를 80.55%에서 81.95%, DeiT-Base를 81.76%에서 83.46%로 개선한다.
로그잇 기반 KD(NKD)와 결합했을 때 ViTKD가 Tiny/Small/Base에 대해 각각 77.78%, 83.59%, 85.41% Top-1로 추가 이득을 얻는다.
ViTKD으로 학습된 모델은 다운스트림 작업도 향상시키며, 예를 들어 Mask-RCNN과 함께 사용할 때 COCO AP box 및 AP mask 지표가 개선된다.
교사와 학생이 같은 아키텍처일 때 ViTKD가 더 나은 지도를 제공하며, 교차 아키텍처 교사는 성능 저하를 유발할 수 있다.
ViTKD는 하이퍼파라미터 α와 β에 대해 강건하고 NKD와 보완적 이득을 다양한 교사–학생 쌍에서 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.