QUICK REVIEW

[论文解读] Feature Matters: A Stage-by-Stage Approach for Knowledge Transfer.

Mengya Gao, Yujun Shen|arXiv (Cornell University)|Dec 5, 2018

Advanced Neural Network Applications参考文献 21被引用 3

一句话总结

本文提出阶段式知识蒸馏（SSKD），一种两阶段训练方法，首先将教师模型的特征表示知识迁移至学生模型，随后仅微调任务特定的分类头。通过将特征知识迁移与头部训练解耦，SSKD 消除了对人工损失加权的依赖，并在 CIFAR-100、ImageNet、IJB-A 和 COCO 基准上实现了最先进性能。

ABSTRACT

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and the KD loss simultaneously, using a pre-defined loss weight to balance these two terms. In this work, we propose to first transfer the backbone knowledge from a teacher to the student, and then only learn the task-head of the student network. Such a decomposition of the training process circumvents the need of choosing an appropriate loss weight, which is often difficult in practice, and thus makes it easier to apply to different datasets and tasks. Importantly, the decomposition permits the core of our method, Stage-by-Stage Knowledge Distillation (SSKD), which facilitates progressive feature mimicking from teacher to student. Extensive experiments on CIFAR-100 and ImageNet suggest that SSKD significantly narrows down the performance gap between student and teacher, outperforming state-of-the-art approaches. We also demonstrate the generalization ability of SSKD on other challenging benchmarks, including face recognition on IJB-A dataset as well as object detection on COCO dataset.

研究动机与目标

为解决知识蒸馏中超参数敏感性问题，特别是平衡任务损失与 KD 损失的困难挑战。
通过实现从教师到学生模型的渐进式、结构化特征模仿，提升学生模型性能。
通过解耦训练策略，消除知识蒸馏中对人工损失加权的需求。
在包括图像分类、人脸识别和目标检测在内的多样化视觉任务中，展示所提方法的泛化能力。

提出的方法

该方法将知识蒸馏分解为两个独立阶段：第一阶段，训练学生模型的主干网络以模仿教师模型的特征；第二阶段，仅对任务特定的分类头进行微调。
在第一阶段，使用特征级蒸馏损失（如特征图匹配或对比损失）将学生模型的特征表示与教师模型对齐。
第二阶段采用标准训练方式，使用任务特定损失进行优化，其中学生模型的头部参数被更新，而特征提取器保持冻结。
该方法避免对任务损失与 KD 损失进行端到端联合优化，从而消除了对损失加权超参数的需求。
该方法支持渐进式特征模仿，使学生模型能够逐步学习教师模型的层次化表示。

实验结果

研究问题

RQ1将知识蒸馏解耦为独立阶段是否能在无需损失权重调优的情况下提升模型性能？
RQ2分阶段训练在不同数据集上对学生模型的特征表示学习产生何种影响？
RQ3所提方法在图像分类之外的多样化视觉任务中具有多大程度的泛化能力？
RQ4与联合训练相比，渐进式特征模仿是否能带来学生与教师特征之间更优的对齐？

主要发现

SSKD 显著缩小了学生模型与教师模型在 CIFAR-100 和 ImageNet 上的性能差距，优于现有最先进方法。
该方法在参数量远少于教师模型的学生模型上，实现了 ImageNet 上的优越准确率。
在 IJB-A 人脸识别基准上，SSKD 展现出强大的泛化能力，性能优于标准 KD 基线方法。
在 COCO 目标检测任务中，SSKD 取得了具有竞争力的结果，证实了其在分类任务之外的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。