QUICK REVIEW

[论文解读] Meta Knowledge Distillation

Jihao Liu, Boxiao Liu|arXiv (Cornell University)|Feb 16, 2022

Advanced Neural Network Applications被引用 20

一句话总结

tldr: Meta Knowledge Distillation (MKD) 通过元学习蒸馏温度（教师与学生的蒸馏温度）来缓解知识蒸馏退化，在 ImageNet-1K 上无需额外数据即可提升 ViT 性能，保持原文的术语。

ABSTRACT

Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.

研究动机与目标

确定在使用强数据增强和更大容量的教师模型时，标准 KD 为什么会退化。
提出一个元学习框架，用以自适应设置教师和学生的蒸馏温度。
演示 MKD 在不同数据集规模、架构和数据增强下的鲁棒性。
相较于使用 ImageNet-1K 的先前方法，展示 MKD 在 Vision Transformers (ViT) 上的有效性。

提出的方法

将 KD 表述为对教师和学生使用分离的温度（tau_t, tau_s）。
引入元参数，通过在验证集上的元目标在线优化这些温度。
进行一次对学生的预更新，然后通过反向传播验证损失来更新元参数。
使用新学习的温度更新学生。
可选地用一个小网络（temperature prediction network）建模温度以实现更快的自适应。
提供聚焦于错误分类样本的替代元目标。

实验结果

研究问题

RQ1自适应的教师与学生温度是否能缓解 KD 中的师生差距与增强不兼容性？
RQ2MKD 是否在使用 ImageNet-1K 的标准数据下提升 ViT 及其他架构？
RQ3教师与学生的分离温度是否优于共用或网格搜索得到的值？
RQ4MKD 对数据集大小、教师/学生架构和数据增强类型的鲁棒性如何？

主要发现

经过适当调优的温度可以显著缓解由强数据增强和容量差距引起的 KD 退化。
MKD 在 CIFAR-100 和 ImageNet-1K 基准测试中优于网格搜索温度和标准 KD。
在从头开始用 ImageNet-1K 训练的 ViT 架构上，MKD 在 ViT-L 上达到 86.5% 的 top-1（相比此前报道的 85.15%）。
MKD 在不同学生大小上相较于先前的 ViT 蒸馏方法，带来 2.0–4.2 个百分点的提升。
使用温度预测网络可提升自适应速度和最终性能。
联合学习分离的 tau_s 与 tau_t 在所测试的元学习设置中提供了最佳结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。