QUICK REVIEW

[論文レビュー] Meta Knowledge Distillation

Jihao Liu, Boxiao Liu|arXiv (Cornell University)|Feb 16, 2022

Advanced Neural Network Applications被引用数 20

ひとこと要約

Meta Knowledge Distillation (MKD) は、教師と学生の蒸留温度をメタ学習で学習させることで知識蒸留の劣化を緩和し、追加データなしで ImageNet-1K における ViT の性能を向上させます。

ABSTRACT

Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.

研究の動機と目的

標準的なKDが強力なデータ拡張と大きな教師を使用した場合に劣化する理由を特定する。
教師と学生の蒸留温度を適応的に設定するメタ学習フレームワークを提案する。
データセットの規模、アーキテクチャ、および拡張に対する MKD の頑健性を示す。
ImageNet-1K を用いた従来法と比較して Vision Transformer（ViT）に対する MKD の有効性を示す。

提案手法

教師と学生で別個の温度（tau_t, tau_s）を用いてKDを定式化する。
検証セット上のメタ目的によりオンラインでこれらの温度を最適化するメタパラメータを導入する。
学生の1ステップ前更新を実行し、その後検証損失を逆伝播してメタパラメータを更新する。
新たに学習された温度で学生を更新する。
より速い適応のために、小さなネットワーク（温度予測ネットワーク）で温度をモデル化することもある。
誤分類サンプルに焦点を当てた代替メタ目的を提供する。

実験結果

リサーチクエスチョン

RQ1KDにおいて教師と学生の適応温度は、教師-学生間のギャップと拡張適合性の不整合を緩和できるか？
RQ2MKD は標準データを用いた ImageNet-1K で ViT や他のアーキテクチャを改善するか？
RQ3教師と学生に別々の温度は、共有またはグリッド探索された値より良いか？
RQ4データセットサイズ、教師/学生のアーキテクチャ、拡張タイプに対する MKD の頑健性はどの程度か？

主な発見

適切に調整された温度は、強力な拡張と容量ギャップによって引き起こされるKDの劣化を大幅に緩和できる。
MKD は CIFAR-100 と ImageNet-1K のベンチマークで、グリッド探索された温度および標準的なKDを上回る。
ImageNet-1K でスクラッチから学習した ViT アーキテクチャにおいて、MKD は ViT-L でトップ1 86.5% を達成（従来報告の 85.15% に対して）。
MKD はさまざまな学生サイズに対して、前の ViT 蒸留法より 2.0–4.2 ポイントの向上をもたらす。
温度予測ネットワークを使用すると適応速度と最終性能が向上する。
tau_s と tau_t を別々に同時学習することが、検証されたメタ学習設定の中で最良の結果を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。