[论文解读] Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective
本文分析知识蒸馏中软标签如何引入样本级的偏差-方差权衡,并引入权重软标签以自适应平衡该权衡,通过标准基准数据集的实验验证。
Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies \citep{muller2019does,yuan2020revisiting} revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at \url{https://github.com/bellymonster/Weighted-Soft-Label-Distillation}.
研究动机与目标
- 从偏差-方差角度动机分析KD中的软标签。
- 表征在KD训练中偏差和方差如何在每个样本层面演化。
- 识别对KD性能影响不成比例的正则化样本。
- 提出并验证加权软标签,在训练过程中自适应地管理样本级别的偏差-方差。
提出的方法
- 使用基于KL散度的分析将KD损失分解为偏差-方差分量。
- 比较直接训练(交叉熵)与蒸馏损失(KD)的偏差-方差分解。
- 显示存在方差降低占优且偏差增加的正则化样本。
- 引入一种基于教师和学生预测的对温度不敏感的软标签加权方案(weighted soft labels)。
- 以平衡超参数α将L_ce与加权KD损失(L_wsl)结合用于训练。
实验结果
研究问题
- RQ1使用软标签的知识蒸馏在训练过程中偏差和方差如何演化?
- RQ2在固定蒸馏温度下,正则化样本在KD性能中的作用是什么?
- RQ3样本级加权方案是否能缓解正则化样本的负面影响并提升KD性能?
主要发现
- 软标签既是监督信号也是正则化器,导致样本级的偏差-方差权衡。
- 在相同温度下,样本的一个子集(正则化样本)因偏差增加、方差收益下降而与KD性能呈负相关。
- 彻底过滤掉正则化样本会削弱性能,表明它们包含可被KD利用的信息。
- 一种简单的加权软标签方案(L_wsl)减轻了正则化样本的负面影响并提升KD性能。
- 在CIFAR-100和ImageNet上的实验在多种师生对中显示与最先进KD方法相竞争或优越的结果。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。