QUICK REVIEW

[论文解读] How Does Learning Rate Decay Help Modern Neural Networks?

Kaichao You, Mingsheng Long|arXiv (Cornell University)|Aug 5, 2019

Neural Networks and Applications参考文献 42被引用 161

一句话总结

本文主张学习率衰减通过在初始较大学习率的条件下防止对噪声数据的记忆，同时通过学习复杂模式来提升现代神经网络的学习能力；并通过受控实验与在真实数据集上的可迁移性分析来验证这一模式复杂性观点。

ABSTRACT

Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a large learning rate and then decays it multiple times. It is empirically observed to help both optimization and generalization. Common beliefs in how lrDecay works come from the optimization analysis of (Stochastic) Gradient Descent: 1) an initially large learning rate accelerates training or helps the network escape spurious local minima; 2) decaying the learning rate helps the network converge to a local minimum and avoid oscillation. Despite the popularity of these common beliefs, experiments suggest that they are insufficient in explaining the general effectiveness of lrDecay in training modern neural networks that are deep, wide, and nonconvex. We provide another novel explanation: an initially large learning rate suppresses the network from memorizing noisy data while decaying the learning rate improves the learning of complex patterns. The proposed explanation is validated on a carefully-constructed dataset with tractable pattern complexity. And its implication, that additional patterns learned in later stages of lrDecay are more complex and thus less transferable, is justified in real-world datasets. We believe that this alternative explanation will shed light into the design of better training strategies for modern neural networks.

研究动机与目标

挑战关于 lrDecay 在深度网络中为何有效的常见解释。
提出基于模式复杂性的 lrDecay 影响观点。
在可处理的数据集上通过受控实验验证所提出的观点。
验证跨数据集学习模式的可迁移性影响。

提出的方法

对比 GD/SGD 对 lrDecay 的解释与在 CIFAR-10 上训练的 WideResNet 的经验结果。
构建 Pattern Separation 10 (PS10) 数据集以将简单模式和复杂模式分离。
将模式复杂性定义为条件熵的期望值，并衡量在 lrDecay 条件下对简单模式与复杂模式的学习。
通过迁移学习实验评估后期阶段学习的模式向目标数据集的迁移情况。
分析 Hessian 特征值以论证学习动力学与衰减相关性。

实验结果

研究问题

RQ1GD/SGD 的解释是否完全能够解释现代网络中 lrDecay 的收益？
RQ2lrDecay 是否主要有助于学习复杂模式，而初始较大 LR 是否抑制了对噪声数据的记忆？
RQ3模式复杂性框架能否解释观测到的训练与可迁移性现象？
RQ4lrDecay 如何影响在不同训练阶段学习的模式向目标任务的可迁移性？
RQ5现实世界数据集是否在后期 lrDecay 阶段学习的模式表现出下降的可迁移性？

主要发现

lrDecay 提高了对复杂模式的学习能力，而不仅仅是收敛或避免局部极小点。
初始较大学习率有助于避免对噪声数据的记忆，起到正则化作用。
一个可处理的 PS10 数据集显示简单模式先被学习，而衰减增强对复杂模式的学习。
在真实数据集中，后期 lrDecay 阶段所学习的模式在不同任务间的可迁移性较低。
在 ImageNet 与目标数据集的可迁移性分析中，较新的模式的可迁移性日益降低，支持基于复杂性的观点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。