QUICK REVIEW

[论文解读] Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

Chenglin Yang, Lingxi Xie|arXiv (Cornell University)|May 15, 2018

Online Learning and Analytics参考文献 41被引用 67

一句话总结

本文提出在世代中训练神经网络，采用一个容忍的教师对次级类别软性地分配置信度（top-score difference loss），使学生能够学习类别间的相似性，并在 CIFAR100 和 ILSVRC2012 上超越基线。

ABSTRACT

We focus on the problem of training a deep neural network in generations. The flowchart is that, in order to optimize the target network (student), another network (teacher) with the same architecture is first trained, and used to provide part of supervision signals in the next stage. While this strategy leads to a higher accuracy, many aspects (e.g., why teacher-student optimization helps) still need further explorations. This paper studies this problem from a perspective of controlling the strictness in training the teacher network. Existing approaches mostly used a hard distribution (e.g., one-hot vectors) in training, leading to a strict teacher which itself has a high accuracy, but we argue that the teacher needs to be more tolerant, although this often implies a lower accuracy. The implementation is very easy, with merely an extra loss term added to the teacher network, facilitating a few secondary classes to emerge and complement to the primary class. Consequently, the teacher provides a milder supervision signal (a less peaked distribution), and makes it possible for the student to learn from inter-class similarity and potentially lower the risk of over-fitting. Experiments are performed on standard image classification tasks (CIFAR100 and ILSVRC2012). Although the teacher network behaves less powerful, the students show a persistent ability growth and eventually achieve higher classification accuracies than other competitors. Model ensemble and transfer feature extraction also verify the effectiveness of our approach.

研究动机与目标

阐明为何教师-学生优化在超越教师简单准确度方面有帮助。
引入容忍教师机制，通过监督信号保留类别间的相似性。
提出并评估 top-score-difference (TSD) 损失，以生成有用的次级信息。
通过 Dist^C 与 Dist^S 指标量化次级信息对学习动态的影响。
证明在标准数据集（CIFAR100 和 ILSVRC2012）上进行基于世代的训练时，学生表现的提升。

提出的方法

将优化框架设定为在世代中训练，包含祖父教师（初始教师）与连续的学生。
使用将真实标签与教师引导相结合的混合监督损失（Eq. 2）。
引入通过使用前 K 个次级类别方案软化输出分布来保留次级信息的容忍教师目标（Eq. 3）。
用 K、u(η) 和 λ 对方法进行参数化；设定 K = 5，并使用 u(η) 代替 η 以提高稳定性（Eq. 3.4）。
在 CIFAR100 和 ILSVRC2012 上将基线的 one-hot 训练、标签平滑和置信惩罚与 top-score-difference (TSD) 变体进行比较。
通过 Dist^C 与 Dist^S 指标评估次级信息的质量，并将其与最终准确度相关联。

实验结果

研究问题

RQ1在生成式学习中，使用保留次级信息的容忍教师训练是否能提高学生的准确性？
RQ2教师的软化分布应如何设计（强调哪些次级类别）以最大化学生收益？
RQ3有哪些定量指标（Dist^C、Dist^S）与生成式训练中更好的学生表现相关？
RQ4生成式、容忍教师方法是否能从 CIFAR 类设置迁移到像 ILSVRC2012 这样的大规模数据集？
RQ5在不同架构下实现最大收益的最优超参数（K、通过 u(η) 的 η、λ）是什么？

主要发现

保留次级信息的容忍教师在多代中持续带来学生收益，通常超越祖父基线。
在 CIFAR100 上，最佳增益来自 TSD-0.6，相较基线和其他错误具有更高的最终测试准确度；更深网络的 CNNs 也显示出类似的受益。
在 CIFAR100 上，容忍教师变体的最佳报告准确率达到 73.72%（相比报告中的基线约为 71.5%–72.5%），集成进一步提升结果。
在 ILSVRC2012（ResNet-18）上，容忍教师变体 D(0.6,0.6) 在最佳世代中将 top-1 从约 30.50% 提升到 29.60%，top-5 从 11.07% 提升到 10.11%，集成结果继续带来增益。
DenseNets (100/190 层) 在使用 D(0.6,0.6) 或 D(0.7,0.6) 时，单模型获得 1–2% 的增益，集成达到 5%+ 的增益，接近最先进水平且无需额外的推理时延成本。
该研究引入 Dist^C 与 Dist^S 作为粗粒度和语义层级的类别分辨能力衡量指标，将更高的 Dist^S 与更好的粗粒度学习相关联，并将较低的 Dist^C 与有意义的次级信息相关联。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。