QUICK REVIEW

[論文レビュー] ViTKD: Practical Guidelines for ViT feature knowledge distillation

Zhendong Yang, Zhe Li|arXiv (Cornell University)|Sep 6, 2022

Advanced Neural Network Applications被引用数 23

ひとこと要約

この論文は Vision Transformers (ViT) の特徴ベースの知識蒸留を研究し、3つの実践的ガイドラインを導出し、ViTKDを提案し、ImageNet-1k での一貫した改善を示し、logitベースのKDを補完する。

ABSTRACT

Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%. Moreover, ViTKD and the logit-based KD method are complementary and can be applied together directly. This combination can further improve the performance of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve 77.78%, 83.59%, and 85.41%, respectively. The code is available at https://github.com/yzd-v/cls_KD.

研究の動機と目的

Motivate and understand feature-based knowledge distillation for ViT models, which differ from CNNs due to ViT’s attention-based structure.
Identify effective strategies for distilling ViT features across different layers and modules.
Develop a ViT-specific distillation method that yields consistent improvements on ImageNet-1k.
Demonstrate that ViTKD can be complementary to logit-based KD methods and beneficial for downstream tasks.

提案手法

Analyze ViT feature maps and attention behaviors across layers to design distillation guidelines.
Investigate mimicking (linear layer alignment and correlation matrix) for shallow layers.
Investigate generation-based distillation (masking tokens and using generative blocks such as cross-attention, self-attention, or convolutional projector) for deep layers.
Define ViTKD as combining shallow-layer mimicking and deep-layer generation with a total loss: L = L_ori + alpha L_lr + beta L_gen.
Use L2-based distillation losses on features and generated targets, with an adaptation layer where needed.
Provide implementation details including mask ratio lambda = 0.5 and hyperparameters alpha = 3e-5, beta = 3e-6 for ImageNet-1k experiments.

実験結果

リサーチクエスチョン

RQ1Can ViT-specific feature distillation outperform CNN-based feature distillation when transferring knowledge to a smaller ViT student?
RQ2Which layers (shallow vs deep) and which distillation mechanisms (mimicking vs generation) yield the most benefit for ViT feature distillation?
RQ3Is ViTKD complementary to logit-based KD methods, and can their combination further improve performance?
RQ4How do distillation strategies transfer to downstream tasks beyond image classification (e.g., object detection)?

主な発見

教師	学生	タイプ	Top-1 精度	Top-5 精度
DeiT-Small (80.69)	DeiT-Tiny	-	74.42	92.29
DeiT-Small (80.69)	DeiT-Tiny	Ours (feature)	75.40	92.66
DeiT-Small (80.69)	DeiT-Tiny	Ours+NKD (feature+logit)	76.18	93.14
DeiT III-Small* (82.76)	DeiT-Tiny	-	74.42	92.29
DeiT III-Small* (82.76)	DeiT-Tiny	Ours (feature)	76.06	93.16
DeiT III-Small* (82.76)	DeiT-Tiny	Ours+NKD (feature+logit)	77.78	93.97
DeiT III-Base* (85.48)	DeiT-Small	-	80.55	95.12
DeiT III-Base* (85.48)	DeiT-Small	Ours (feature)	81.95	95.64
DeiT III-Base* (85.48)	DeiT-Small	Ours+NKD (feature+logit)	83.59	96.69
DeiT III-Large* (86.81)	DeiT-Base	-	81.76	95.81
DeiT III-Large* (86.81)	DeiT-Base	Ours (feature)	83.46	96.41
DeiT III-Large* (86.81)	DeiT-Base	Ours+NKD (feature+logit)	85.41	97.39

Three practical guidelines emerge: use mimicking for shallow layers and generation for deep layers; focusing on FFN-out or MHA-out features favors FFN-out for distillation; shallow-layer knowledge is particularly beneficial for ViT distillation.
ViTKD improves DeiT-Tiny from 74.42% to 76.06% Top-1 on ImageNet-1k, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%.
When combined with logit-based KD (NKD), ViTKD yields further gains to 77.78%, 83.59%, and 85.41% Top-1 respectively for Tiny/Small/Base.
ViTKD-trained models also improve downstream tasks; for example, using ViTKD with Mask-RCNN yields improved COCO AP box and AP mask metrics.
Teachers with the same architecture as the student provide better guidance for ViTKD; cross-architecture teachers can degrade performance.
ViTKD demonstrates robustness to hyper-parameters alpha and beta and shows complementary gains with NKD across varied teacher-student pairs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。