[论文解读] Revisit Knowledge Distillation: a Teacher-free Framework
本文提出了一种无需教师模型的知识蒸馏方法(Tf-KD),该方法使学生模型能够从自身或人工设计的正则化分布中蒸馏知识,从而无需使用预训练的教师模型。该方法在不增加计算成本的前提下,实现了与使用强大教师模型的传统知识蒸馏相当的性能,ImageNet准确率最高提升0.65%。
Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories, but also to the regularization of soft targets, which is equally or even more important. Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manually-designed regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65\% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization. The codes are in: \url{this https URL}
研究动机与目标
- 挑战当前普遍认为只有强大教师模型才能实现有效知识蒸馏的假设。
- 探究知识蒸馏的成功是否主要源于软标签正则化,而非类别间相似性信息。
- 开发一种无需计算开销的通用框架,实现在无预训练教师模型情况下的蒸馏。
- 证明自蒸馏或人工正则化可实现与教师-学生知识蒸馏相当的性能。
提出的方法
- 该方法将知识蒸馏形式化为一种学习到的标签平滑正则化形式。
- 建立了知识蒸馏与标签平滑之间的理论联系,表明知识蒸馏通过软目标隐式地应用了一个虚拟教师模型。
- 在训练过程中使用学生模型自身的预测作为伪软标签,从而实现自蒸馏。
- 当自蒸馏效果不足时,允许人工设计正则化分布。
- 该方法具有通用性,可直接应用于训练深度神经网络,无需架构修改。
- 除标准训练外,无需额外的推理或训练计算。
实验结果
研究问题
- RQ1知识蒸馏是否可以在没有预训练教师模型的情况下依然有效?
- RQ2知识蒸馏的成功是否主要源于类别间相似性信息,还是软标签正则化?
- RQ3学生模型能否通过利用软目标实现自蒸馏并实现性能提升?
- RQ4在性能方面,知识蒸馏与标签平滑正则化相比如何?
- RQ5人工设计的正则化分布是否能实现与使用强大教师模型的知识蒸馏相当的结果?
主要发现
- 与成熟的基线模型相比,Tf-KD在ImageNet上实现了最高0.65%的top-1准确率提升。
- Tf-KD的性能与使用更优教师模型的传统知识蒸馏相当。
- 使用学生自身预测进行自蒸馏可带来显著的准确率增益,即使学生初始性能优于教师模型。
- 标签平滑正则化被证明是知识蒸馏的一个特例,而知识蒸馏提供了更灵活且更有效的正则化形式。
- 该框架具有通用性,可直接应用于训练深度神经网络,且无需额外计算成本。
- 理论分析证实,知识蒸馏可视为一种带有虚拟教师模型的、学习到的标签平滑正则化。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。