QUICK REVIEW

[论文解读] Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher

Seyed Iman Mirzadeh, Mehrdad Farajtabar|arXiv (Cornell University)|Feb 9, 2019

Advanced Neural Network Applications参考文献 26被引用 128

一句话总结

本文提出一种使用教师助教的多步知识蒸馏方法，以弥合大教师网络与小学生网络之间的性能差距。通过引入一个中间尺寸的教师助教，该方法提升了知识迁移效果，尤其是在学生-教师网络尺寸差异较大时，其在ResNet和普通CNN架构下于CIFAR-10和CIFAR-100数据集上实现了最先进（state-of-the-art）的准确率。

ABSTRACT

Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too gigantic to be deployed on edge devices like smart-phones or embedded sensor nodes. There has been efforts to compress these networks, and a popular method is knowledge distillation, where a large (a.k.a. teacher) pre-trained network is used to train a smaller (a.k.a. student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation which employs an intermediate-sized network (a.k.a. teacher assistant) to bridge the gap between the student and the teacher. We study the effect of teacher assistant size and extend the framework to multi-step distillation. Moreover, empirical and theoretical analysis are conducted to analyze the teacher assistant knowledge distillation framework. Extensive experiments on CIFAR-10 and CIFAR-100 datasets and plain CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

研究动机与目标

解决当学生网络显著小于教师网络时知识蒸馏中性能下降的问题。
克服因架构差异过大，导致大教师网络无法有效向极小学生网络传递知识的局限性。
提出一种使用中间尺寸教师助教的多步蒸馏框架，作为知识传递的桥梁。
研究教师助教尺寸的影响，并将该框架扩展至多步蒸馏以进一步提升性能。

提出的方法

在蒸馏流程中，在学生与原始教师之间引入一个尺寸居中的教师助教模型，作为中间桥梁。
通过两步过程，利用教师助教从大教师网络蒸馏知识，并将其传递给更小的学生网络。
在两个步骤中均应用知识蒸馏：第一步从教师到教师助教，第二步从教师助教到学生。
使用软标签和特征级知识迁移优化蒸馏过程，损失函数包含交叉熵和KL散度。
系统性地改变教师助教的尺寸，以分析其对学生性能的影响。
通过级联多个中间模型，将该框架扩展至多步蒸馏，以逐步缩小学生与教师之间的差距。

实验结果

研究问题

RQ1教师助教能否有效弥合大教师与小学校之间的性能差距？
RQ2教师助教的尺寸如何影响最终学生模型的准确率？
RQ3当学生-教师尺寸差距较大时，多步蒸馏是否优于单步蒸馏？
RQ4使用中间模型实现改进知识迁移的理论与实证依据是什么？

主要发现

与标准知识蒸馏相比，所提出的教师助教框架在学生-教师尺寸差距较大时，显著提升了学生网络的准确率。
当教师网络相对于学生网络过大时，性能会下降，证实了有效知识迁移存在实际的上限。
最优的教师助教尺寸位于学生与原始教师之间，性能在中间尺寸时达到峰值。
使用多个教师助教的多步蒸馏进一步提升了CIFAR-10和CIFAR-100的准确率，尤其在ResNet等深层架构上表现更优。
实证结果表明，无论在普通CNN还是ResNet模型上，CIFAR-10和CIFAR-100数据集上均实现了稳定提升。
理论与实证分析表明，教师助教可减少分布偏移，并在知识迁移过程中改善特征对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。