QUICK REVIEW

[论文解读] Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

Linfeng Zhang, Jiebo Song|arXiv (Cornell University)|May 17, 2019

Advanced Neural Network Applications参考文献 40被引用 83

一句话总结

引入自蒸馏，一种在同一模型中更深的网络部分教授更浅的部分的训练框架，在不增加推理成本的情况下提高准确性。它在 CIFAR100 上平均提升约 2.65% 的准确率，并实现深度自适应推理。

ABSTRACT

Convolutional neural networks have been widely deployed in various application scenarios. In order to extend the applications' boundaries to some accuracy-crucial domains, researchers have been investigating approaches to boost accuracy through either deeper or wider network structures, which brings with them the exponential increment of the computational and storage cost, delaying the responding time. In this paper, we propose a general training framework named self distillation, which notably enhances the performance (accuracy) of convolutional neural networks through shrinking the size of the network rather than aggrandizing it. Different from traditional knowledge distillation - a knowledge transformation methodology among networks, which forces student neural networks to approximate the softmax layer outputs of pre-trained teacher neural networks, the proposed self distillation framework distills knowledge within network itself. The networks are firstly divided into several sections. Then the knowledge in the deeper portion of the networks is squeezed into the shallow ones. Experiments further prove the generalization of the proposed self distillation framework: enhancement of accuracy at average level is 2.65%, varying from 0.61% in ResNeXt as minimum to 4.07% in VGG19 as maximum. In addition, it can also provide flexibility of depth-wise scalable inference on resource-limited edge devices.Our codes will be released on github soon.

研究动机与目标

在对准确性要求高的应用中，推动在减少计算量的同时提升 CNN 的准确性。
提出一个自蒸馏框架，在单个网络内通过将其划分为带有分类器的浅层段来蒸馏知识。
表明自蒸馏在多种架构和数据集上提升准确性且不增加额外推理成本。
展示该方法在资源受限设备上实现可扩展、深度感知推理的好处。

提出的方法

将目标 CNN 按深度划分为多个对应的浅层段。
在每个段后附一个瓶颈和全连接分类器（仅用于训练）。
将所有浅层分类器作为学生，在来自最深分类器（教师）的蒸馏下进行训练。
为每个浅层分类器使用三种损失来源：(1) 与标签的交叉熵，(2) 浅层与最深分类器之间的 KL 散度，(3) 通过瓶颈层对齐浅层和最深特征图的 L2 提示损失。
优化每个分类器损失的总和，使用 alpha 和 lambda 平衡三种监督信号；最深分类器仅依赖标签监督。

实验结果

研究问题

RQ1自蒸馏是否在不同的 CNN 架构和数据集上提升准确性且不增加推理成本？
RQ2浅层分类器是否能从最深分类器的蒸馏中受益，以及这对整体模型性能和训练效率有何影响？
RQ3与传统蒸馏和深度监督网络相比，在准确性、训练时间以及在边缘设备的实用性方面如何？
RQ4该方法是否支持可扩展的、深度感知的推理，适用于资源受限的环境？

主要发现

神经网络	基线	分类器1/4	分类器2/4	分类器3/4	分类器4/4	集成
VGG19(BN)	64.47	63.59	67.04	68.03	67.73	68.54
ResNet18	77.09	67.85	74.57	78.23	78.64	79.67
ResNet50	77.68	68.23	74.21	75.23	80.56	81.04
ResNet101	77.98	69.45	77.29	81.17	81.23	82.03
ResNet152	79.21	68.84	78.72	81.43	81.61	82.29
ResNeXt29-8	81.29	71.15	79.00	81.48	81.51	81.90
WideResNet20-8	79.76	68.85	78.15	80.98	80.92	81.38
WideResNet44-8	79.93	72.54	81.15	81.96	82.09	82.61
WideResNet28-12	80.07	71.21	80.86	81.58	81.59	82.09
PyramidNet101-240	81.12	69.23	78.15	80.98	82.30	83.51

自蒸馏在所测试的网络上对 CIFAR100 的平均准确率提升为 2.65%，范围从 0.61%（ResNeXt）到 4.07%（VGG19）。
在 ImageNet 上，评估网络的平均准确率提升为 2.02%。
更深的网络往往从自蒸馏中获益更多（例如 ResNet101/152 的增益更大）。
自蒸馏支持可扩展的基于深度的推理，在推理时使用浅层分类器可实现有意义的加速，同时带来适度的准确性权衡。
与传统蒸馏相比，自蒸馏通常提供同等或更好的准确性提升，无需单独的教师模型，并且训练更快（例如在 CIFAR100 实验中快4.6倍）。
通过自蒸馏训练的浅层分类器，在所有报道的案例中都优于通过深度监督训练的浅层分类器。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。