QUICK REVIEW

[论文解读] The large learning rate phase of deep learning: the catapult mechanism

Aitor Lewkowycz, Yasaman Bahri|arXiv (Cornell University)|Mar 4, 2020

Stochastic Gradient Optimization Techniques参考文献 37被引用 60

一句话总结

论文在梯度下降中引入三种学习率阶段（懒惰、弹射、发散），给出一个可解的有限宽度模型显示弹射动力学导致更平的极小点，并在实际深度网络中通过经验证据验证预测，最佳性能往往出现在大学习率的弹射阶段。

ABSTRACT

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.

研究动机与目标

激发并表征初始学习率如何对深度网络训练与泛化产生深远影响。
引入一个易于分析的有限宽度模型，预测三种不同学习率阶段。
通过在跨体系结构的现实深度网络中验证相位预测来搭建理论与实践的桥梁。
表明在大学习率（弹射）阶段往往达到最优性能。
将平整性与泛化动力学与 SGD 噪声分离，聚焦于学习率驱动的效应。

提出的方法

推导一个拥有大但有限宽度且使用均方误差损失的一隐藏层线性网络的精确梯度下降更新。
使用顶端NTK特征值作为曲率代理，识别并分析三种学习率制。
将分析扩展到具有 d 维输入和 m 个训练样本的完整模型，推导出类似的更新动力学。
在全连接、卷积和残差网络上进行实证实验以测试相位预测。
利用依赖于网络结构的常数 c_act 估计实际的最大学习率，实验中 ReLU 约为 12。

实验结果

研究问题

RQ1在宽但有限的网络中，梯度下降在不同初始学习率下的动态阶段有哪些？
RQ2学习率在训练过程中如何影响核曲率，特别是NTK的顶特征值？
RQ3是否可以稳定大型学习率使其收敛到更平的极小点，并且这对泛化有何影响？
RQ4理论相位预测在现实架构和SGD设置中是否成立？
RQ5架构、非线性以及最大稳定学习率之间的经验关系是什么？

主要发现

存在三种学习率阶段：懒惰（eta < 2/lambda_0），弹射（2/lambda_0 < eta < eta_max），以及发散（eta > eta_max）。
在弹射阶段，初始损失上升并伴随快速曲率下降，随后收敛到比懒惰阶段更平的极小点。
最大稳定学习率大致为 eta_max = c_act./lambda_0，c_act 取决于非线性形式（理论值约为 4，实际 ReLU 约为 12）。
在卷积神经网络、残差网络和全连接网络的实证结果与相位边界一致，在弹射阶段表现出峰值性能。
最优性能往往出现在大学习率的弹射阶段，并且在不同架构和训练预算下保持一致。
弹射阶段后，模型行为类似线性动力学，核接近恒定，表明线性化动力学的恢复。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。