QUICK REVIEW

[论文解读] A Walk with SGD.

Xing Chen, Devansh Arpit|arXiv (Cornell University)|Feb 24, 2018

Stochastic Gradient Optimization Techniques参考文献 38被引用 49

一句话总结

该论文揭示，在过参数化的深度神经网络中，SGD 通过在山谷壁之间‘弹跳’来导航损失曲面，弹跳高度位于山谷底部之上，这一机制由较大的学习率和较小的批量大小所支持。该机制使 SGD 能够高效遍历损失曲面，避开局部障碍物，并加速收敛至更平坦、更具泛化能力的区域。

ABSTRACT

We present novel empirical observations regarding how stochastic gradient descent (SGD) navigates the loss landscape of over-parametrized deep neural networks (DNNs). These observations expose the qualitatively different roles of learning rate and batch-size in DNN optimization and generalization. Specifically we study the DNN loss surface along the trajectory of SGD by interpolating the loss surface between parameters from consecutive extit{iterations} and tracking various metrics during training. We find that the loss interpolation between parameters before and after each training iteration's update is roughly convex with a minimum ( extit{valley floor}) in between for most of the training. Based on this and other metrics, we deduce that for most of the training update steps, SGD moves in valley like regions of the loss surface by jumping from one valley wall to another at a height above the valley floor. This 'bouncing between walls at a height' mechanism helps SGD traverse larger distance for small batch sizes and large learning rates which we find play qualitatively different roles in the dynamics. While a large learning rate maintains a large height from the valley floor, a small batch size injects noise facilitating exploration. We find this mechanism is crucial for generalization because the valley floor has barriers and this exploration above the valley floor allows SGD to quickly travel far away from the initialization point (without being affected by barriers) and find flatter regions, corresponding to better generalization.

研究动机与目标

理解 SGD 在过参数化深度神经网络损失曲面中导航的动态行为。
研究学习率与批量大小在优化与泛化中所起的独立作用。
揭示 SGD 如何通过在训练过程中保持在山谷底部之上来避免局部极小值与障碍物。
解释为何通过该弹跳机制能更高效地抵达损失曲面的更平坦区域。
为深度学习中泛化现象提供经验性证据，揭示一种超越标准优化理论的新型机制。

提出的方法

在连续 SGD 训练迭代的模型参数之间插值损失曲面，以可视化其轨迹。
沿着插值路径追踪损失、与初始化点的距离以及曲率等指标。
分析插值损失路径的形状，识别出在参数更新之间存在最小值（山谷底部）的类似凸性区域。
表征 SGD 更新相对于山谷底部的高度，以量化‘弹跳’行为。
将学习率与批量大小与弹跳高度及山谷底部之上的探索范围相关联。
基于经验观察推断：在山谷底部之上进行探索，可实现更快的曲面遍历并有效逃离局部障碍。

实验结果

研究问题

RQ1SGD 在训练过参数化 DNN 的损失曲面时，其导航机制是怎样的？
RQ2学习率与批量大小在塑造 SGD 轨迹方面各自发挥什么功能作用？
RQ3为何 SGD 尽管未收敛至尖锐极小值，仍能实现良好泛化？
RQ4在连续 SGD 更新之间，损失曲面表现出何种行为？这揭示了优化动力学的哪些特征？
RQ5在山谷底部之上运行在多大程度上使 SGD 能够逃离局部障碍物，并找到更平坦、更具泛化能力的区域？

主要发现

连续 SGD 参数之间的损失插值通常呈现凸性，中间存在一个最小值（即山谷底部），表明损失曲面具有山谷状结构。
SGD 通过从一个山谷壁向另一个山谷壁‘弹跳’来运行，弹跳高度位于山谷底部之上，而非沿底部移动。
较大的学习率可维持较高的弹跳高度，从而实现更大的有效步长，加速损失曲面的遍历。
较小的批量大小引入噪声，促进在山谷底部之上的探索，有助于逃离局部障碍物。
该弹跳机制使 SGD 能够迅速远离初始化点，而不会被局部障碍物阻碍。
在山谷底部之上进行探索的能力，使 SGD 能够快速定位损失曲面中更平坦的区域，而这些区域与更好的泛化性能密切相关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。