QUICK REVIEW

[論文レビュー] Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Zaida Zhou, Chaoran Zhuge|arXiv (Cornell University)|Jun 2, 2020

Advanced Neural Network Applications参考文献 36被引用数 40

ひとこと要約

本論文は Channel Distillation (CD) を導入して教師から student へチャネル単位の注意を転移させ、Guided Knowledge Distillation (GKD) で正しく予測された教師出力のみを使用し、Early Decay Teacher (EDT) 戦略を適用して蒸留の影響を徐々に減衰させ、ImageNet で最先端の成績を達成するとともに CIFAR100 で強力な向上を示します。

ABSTRACT

Knowledge distillation is to transfer the knowledge from the data learned by the teacher network to the student network, so that the student has the advantage of less parameters and less calculations, and the accuracy is close to the teacher. In this paper, we propose a new distillation method, which contains two transfer distillation strategies and a loss decay strategy. The first transfer strategy is based on channel-wise attention, called Channel Distillation (CD). CD transfers the channel information from the teacher to the student. The second is Guided Knowledge Distillation (GKD). Unlike Knowledge Distillation (KD), which allows the student to mimic each sample's prediction distribution of the teacher, GKD only enables the student to mimic the correct output of the teacher. The last part is Early Decay Teacher (EDT). During the training process, we gradually decay the weight of the distillation loss. The purpose is to enable the student to gradually control the optimization rather than the teacher. Our proposed method is evaluated on ImageNet and CIFAR100. On ImageNet, we achieve 27.68% of top-1 error with ResNet18, which outperforms state-of-the-art methods. On CIFAR100, we achieve surprising result that the student outperforms the teacher. Code is available at https://github.com/zhouzaida/channel-distillation.

研究の動機と目的

知識蒸留を、_raw 表現ではなくチャネル単位の注意を転送することで改善する動機づけ。
教師から student へチャネル注意を転送する Channel Distillation (CD) の提案。
正しく予測された教師出力のみを指示に用いる Guided Knowledge Distillation (GKD) の導入。
訓練中に蒸留の影響を徐々に減衰させる Early Decay Teacher (EDT) の組み込み。

提案手法

Channel Distillation (CD): feature map 上の w_c ベースの喪失を用いて、教師と学生のチャネルウェイトを揃えることでチャネル注意を転送。
Guided Knowledge Distillation (GKD): 教師が正しく予測したサンプルのみに KD 損失を適用し、誤った教師予測を無視。
Early Decay Teacher (EDT): 訓練の進行に合わせて蒸留損失の重みを徐々に減衰。
Loss formulation: Loss(s,t) = EDT(alpha)CD(s,t) + GKD(s,t) + CE(s,y).
Ablation and evaluation on ImageNet (ResNet34→ResNet18) and CIFAR100 (ResNet152→ResNet50) against KD, FitNets, AT, RKD, and more.

実験結果

リサーチクエスチョン

RQ1CD によるチャネル単位の注意転送は従来の KD より学生の性能を改善するか？
RQ2GKD による正しく予測された教師出力のみの適用はネガティブな転送を抑制できるか？
RQ3蒸留の影響を徐々に減衰させることは局所最適解および最終精度に影響するか？
RQ4提案手法 CD+GKD+EDT は大規模データセット (ImageNet) と小規模データセット (CIFAR100) で既存の蒸留法とどのように比較されるか？

主な発見

Method	Model	Top-1 error(%)	Top-5 error(%)	GFLOPS(G)
Teacher	ResNet34	26.73	8.74	3.672
Student	ResNet18	30.43	10.76	1.820
KD	ResNet34-ResNet18	29.50	9.52	1.820
FitNets	ResNet34-ResNet18	29.34	10.77	1.820
AT	ResNet34-ResNet18	29.30	10.00	1.820
RKD	ResNet34-ResNet18	28.46	9.74	1.820
CD+GKD+EDT(our)	ResNet34-ResNet18	27.61	9.20	1.820

ImageNet では、CD 単独で標準的な KD を上回り、ResNet18 でトップ1誤差27.68%（要約に記載のトップ1誤差）を達成。
CD+GKD はさらに 27.61% のトップ1誤差と 9.20% のトップ5誤差へ改善。
CD+GKD+EDT は ImageNet の列挙済み手法の中で最先端の性能を達成し、トップ1/トップ5誤差がそれぞれ 27.61% および 9.20% 。
CIFAR100 では CD 蒸留により ResNet50 の student が特定条件下で ResNet152 の teacher を上回ることがある。
GKD は教師の正解予測からのみ知識を取り入れることでネガティブ転送を減らす。
EDT は蒸留の重みを徐々に減衰させ、学生が自身の最適化を洗練させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。