Skip to main content
QUICK REVIEW

[论文解读] When Does Label Smoothing Help?

Rafael Rios Müller, Simon Kornblith|arXiv (Cornell University)|Jun 6, 2019
Time Series Analysis and Forecasting参考文献 17被引用 884
一句话总结

这篇论文分析标签平滑如何影响泛化、校准和知识蒸馏,结果表明它可以改善校准和泛化,但由于 logits 中的信息被抹去,可能会损害蒸馏。

ABSTRACT

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

研究动机与目标

  • 调查为什么以及在何种情况下标签平滑会提升神经网络的性能。
  • 描述标签平滑如何改变倒数第二层表示。
  • 评估标签平滑在跨任务上的模型校准影响。
  • 检查标签平滑如何影响知识蒸馏和信息传递。

提出的方法

  • 提出一种通过投影可视化倒数第二层激活的方法。
  • 使用 expected calibration error (ECE) 和可靠性图来量化校准。
  • 在有/无标签平滑的情况下评估图像分类和翻译任务的校准和准确性。
  • 使用 teacher–student 设置分析标签平滑对知识蒸馏的影响。
  • 估计输入与 logits 之间的互信息,以研究标签平滑下的信息保留。

实验结果

研究问题

  • RQ1标签平滑是否提高模型校准,从而影响像 beam-search 这样的下游任务?
  • RQ2标签平滑如何重塑倒数第二层表示?
  • RQ3尽管提升了教师模型的准确性,为什么标签平滑会削弱知识蒸馏?
  • RQ4标签平滑、互信息与网络中的信息压缩之间的关系是什么?

主要发现

  • 标签平滑提升校准,并且可以降低预测的过度自信。
  • 标签平滑导致倒数第二层激活形成更紧密且等间距的簇,表明在类别之间存在信息被抹去的效应。
  • 标签平滑在翻译任务中提升 BLEU 和校准,但相比硬目标,NLL 更差。
  • 使用标签平滑训练的教师进行的蒸馏可能不如用硬目标训练的教师的蒸馏,因为丢失了 logit 信息。
  • 输入与 logit 差之间的互信息随着标签平滑而下降,表明表示中存在信息被抹去。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。