QUICK REVIEW

[论文解读] Towards Understanding Label Smoothing

Yi Xu, Yuanhong Xu|arXiv (Cornell University)|Jun 20, 2020

Advanced Neural Network Applications参考文献 55被引用 28

一句话总结

本文提出了一种名为两阶段标签平滑（TSLA）的新训练策略，该策略在训练初期应用标签平滑正则化（LSR）以减少梯度方差并加速收敛，随后在后期阶段切换至标准的一热编码标签。TSLA在ResNet模型上实现了更快的收敛速度和更优的泛化性能，在CIFAR-100和ImageNet基准测试中达到了最先进准确率，其有效性通过理论分析和大量实验得到验证。

ABSTRACT

Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first stage and essentially converges faster in the second stage. To the best of our knowledge, this is the first work for understanding the power of LSR via establishing convergence complexity of stochastic methods with LSR in non-convex optimization. We empirically demonstrate the effectiveness of the proposed method in comparison with baselines on training ResNet models over benchmark data sets.

研究动机与目标

为了从理论上理解标签平滑正则化（LSR）如何改善深度学习中的优化过程。
为了分析在非凸设置下使用LSR的随机梯度下降（SGD）的收敛行为。
为了开发一种实用的训练策略，以利用LSR的优势，同时避免其在训练后期可能带来的负面影响。
为了通过实证方法证明：在训练后期从平滑标签切换至一热编码标签，能够提升泛化性能并加快收敛速度。

提出的方法

提出两阶段标签平滑（TSLA），在初始训练阶段应用LSR，并在后续阶段停止使用。
使用标签平滑变换：y^LS = (1−θ)y + θŷ，其中ŷ为均匀分布或预训练模型输出的概率分布。
分析SGD在LSR下的收敛性，表明适当的LSR可降低梯度方差并改善迭代复杂度。
采用两阶段训练协议：前s个周期使用LSR训练，之后切换至标准一热编码标签完成剩余训练周期。
利用预训练模型生成ŷ以改进平滑效果，降低方差并提升性能。
在ImageNet和CIFAR-100上，基于ResNet-18和ResNet-50的标准训练协议，采用学习率衰减和权重衰减。

实验结果

研究问题

RQ1标签平滑正则化（LSR）在非凸优化中如何影响随机梯度下降（SGD）的收敛性？
RQ2LSR是否能降低梯度方差，从而在深度学习训练中加速收敛？
RQ3是否存在一个最优时机在训练过程中禁用LSR以实现性能最大化？
RQ4在训练后期从平滑标签切换至一热编码标签是否能提升泛化能力和收敛速度？
RQ5平滑分布的选择（均匀分布 vs. 预训练模型输出）对性能有何影响？

主要发现

在CIFAR-100上，TSLA采用160个周期的LSR后切换为一热编码标签，实现了78.55%的top-1准确率，优于所有基线方法。
TSLA-pre(160)在CIFAR-100上实现了78.55%的top-1准确率和94.83%的top-5准确率，是所有方法中的最佳表现。
在ImageNet上，TSLA(50)相比标准LSR提升了0.5%的top-1准确率，相比基线提升了0.7%。
理论分析证实，适当的LSR可降低梯度方差，从而改善收敛复杂度。
在120至180个周期后从LSR切换至一热编码标签，能持续加速收敛并提升测试准确率。
使用预训练模型输出进行平滑（TSLA-pre）显著优于均匀分布平滑，证实了低方差标签分布的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。