QUICK REVIEW

[论文解读] Knowledge Distillation via Route Constrained Optimization

Jin Xiao, Baoyun Peng|arXiv (Cornell University)|Apr 19, 2019

Advanced Neural Network Applications参考文献 34被引用 36

一句话总结

RCO 通过沿教师训练路径遵循由易到难的一系列教师检查点来训练学生，在 CIFAR-100、ImageNet 与 MegaFace 上优于标准知识蒸馏。它利用锚点引导来降低一致性差距。

ABSTRACT

Distillation-based learning boosts the performance of the miniaturized neural network based on the hypothesis that the representation of a teacher model can be used as structured and relatively weak supervision, and thus would be easily learned by a miniaturized model. However, we find that the representation of a converged heavy model is still a strong constraint for training a small student model, which leads to a high lower bound of congruence loss. In this work, inspired by curriculum learning we consider the knowledge distillation from the perspective of curriculum learning by routing. Instead of supervising the student model with a converged teacher model, we supervised it with some anchor points selected from the route in parameter space that the teacher model passed by, as we called route constrained optimization (RCO). We experimentally demonstrate this simple operation greatly reduces the lower bound of congruence loss for knowledge distillation, hint and mimicking learning. On close-set classification tasks like CIFAR100 and ImageNet, RCO improves knowledge distillation by 2.14% and 1.5% respectively. For the sake of evaluating the generalization, we also test RCO on the open-set face recognition task MegaFace.

研究动机与目标

从教师优化路径的视角重新思考知识蒸馏。
提出路由约束优化（RCO），通过教师轨迹上的锚点来引导学生。
证明 RCO 在分类与人脸识别任务上优于标准 KD 和最先进方法。

提出的方法

定义教师（φ_t）和学生（φ_s），使用 softmax 输出和 KL/MSE 损失进行知识蒸馏。
表明用收敛的教师进行监督对学生而言可能因容量差距而更困难。
在教师训练路径上引入锚点 C_1,...,C_n，并让学生按顺序在每一步拟合 φ_t(·;W_{C_i})。
提供两种锚点选择策略：等分纪元间隔（EEI）和基于 KL 发散度选择信息性锚点的贪心搜索（GS）。
将第 i 步的 Loss_i 公式化为 Loss_i = H(y, φ_s(x;W_s)) + λ H(φ_s(x;W_s), φ_t(x;W_{C_i}))。
讨论课程学习的基本原理以及教师轨迹所创造的渐易-渐难序列。

实验结果

研究问题

RQ1相比于仅使用最终收敛的教师，沿教师训练路径的中间状态进行监督是否能降低小模型的一致性损失？
RQ2锚点选择策略（EEI、GS）是否能有效创造有益的学习进程，提升学生表现与训练效率？
RQ3RCO 的增益是否在 CIFAR-100、ImageNet-1K 与 MegaFace 人脸识别任务上保持一致？
RQ4RCO 是否可与现有知识迁移方法结合以进一步提升性能？

主要发现

Experiment	Model	Top-1	Top-5
CIFAR-100 (KD vs RCO)	S-KD (MobileNetV2)	68.71	-
CIFAR-100 (RCO)	S-RCO (MobileNetV2)	70.85	-
ImageNet (Student)	Student-KD (MobileNetV2)	66.75	87.30
ImageNet (RCO)	Student-RCO (MobileNetV2)	68.21	88.04

RCO 将 KD 的顶级精度提升约 2.14%（CIFAR-100 上的 Top-1）和 1.5%（ImageNet-1K 上的 Top-1），相较于标准 KD。
在 MegaFace 人脸识别任务中，RCO 以仅 0.8M 参数达到 84.3% 的准确率，超过以往方法。
RCO 在 CIFAR-100、ImageNet 和 MegaFace 的各项指标上均优于 KD 与其他 SOTA 方法。
贪心搜索锚点策略在评估的方法中表现最佳，EEI 与 GS 在训练时间与准确率之间提供权衡。
RCO 可以与以往的知识迁移方法结合，进一步提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。