QUICK REVIEW

[论文解读] Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Yuanzhi Li, Colin Wei|arXiv (Cornell University)|Jul 10, 2019

Stochastic Gradient Optimization Techniques参考文献 40被引用 124

一句话总结

论文通过分析两类模式设定中的学习顺序并结合 CIFAR-10 补丁进行验证，给出理论与实证解释，为什么较大初始学习率随后退火相较于从头就小学习率更能泛化。

ABSTRACT

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

研究动机与目标

激发对为什么较大初始学习率（LR）随后退火相对于从头小 LR 更能提高泛化的理解。
提出一个简单的两模式数据分布，用于研究两层网络中的学习顺序效应。
展示学习顺序通过理论结果和与实际观察相符的回归影响泛化。

提出的方法

用一个特定的 U 分解定义一个两层 ReLU 网络，以分别处理两种数据分量（P：易泛化、难拟合；Q：易拟合、难泛化）。
构造一个具有两种模式的分布，样本中含有每种类型的固定比例 p 与 q。
使用带球状高斯噪声的 SGD，以及两阶段学习率安排（先大 LR 再退火）来分析学习动力学。
推导非正式定理，将大-LR+退火与小-LR在模式学习顺序与泛化方面进行比较。
将网络输出分解为在 Q-patterns 上的分量 g_t(x) 与在 P-patterns 上的分量 r_t(x)，以跟踪学习进度。

实验结果

研究问题

RQ1对于一个两模式数据分布，较大初始学习率并退火是否比从头小 LR 有更好的泛化？
RQ2网络以何种顺序学习不同模式类型会如何影响最终泛化？
RQ3通过受控实验（如在 CIFAR-10 上的可记忆补丁）在实际场景中观察学习顺序现象吗？

主要发现

对于构造的数据集，具备较大初始 LR 并随后退火的两层网络先学习难以泛化、易拟合的模式，再学习易泛化、难拟合的模式，且在退火后才完成。
较小初始 LR 快速学习易泛化、难拟合的模式并对它们过拟合，训练后对难泛化模式的泛化更差。
大型 LR-随后退火方法的最终测试误差比小 LR 方法小，比例与 p 相关（分析中大致为 O(p)）。
论文提供一个下界，显示小 LR 方法可以实现更好的训练损失，但由于在某些模式分量上的记忆偏差，测试误差更差。
受分析启发的缓解策略—在激活前添加在选定时期逐步减小的噪声—可以达到与大 LR 相当的保证并提高鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。