QUICK REVIEW

[论文解读] On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Panayotis Mertikopoulos, Nadav Hallak|arXiv (Cornell University)|Jun 19, 2020

Stochastic Gradient Optimization Techniques参考文献 38被引用 37

一句话总结

本文证明在广义步长序列下，非凸目标的SGD几乎必然收敛，并且SGD以概率1避免严格鞍点，推导出对Hurwicz-正则局部极小点的1/n^p收敛速率，且有实验支持。

ABSTRACT

This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $Θ(1/n^p)$ step-size schedule. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.

研究动机与目标

在广义步长序列下，建立非凸目标的SGD轨迹的几乎必然收敛。
证明SGD以概率1避免严格鞍点/鞍流形。
表征在消失步长下收敛到Hurwicz-正则局部极小点的速率。
提供关于步长调参的实际见解，包括降温策略，并有实验支持。

提出的方法

将SGD建模为梯度流的Robbins–Monro离散化，并将其作为梯度动力学(GD)的渐近伪轨迹(APT)进行研究。
在温和的光滑性假设和一系列步长 γn = Θ(1/n^p) 下，证明SGD轨迹有界性（预致紧性）。
利用APT理论和李雅普诺夫性质，证明SGD几乎必然收敛到使f取常数的临界集的一个连通分量。
在均匀激励噪声假设下，通过概率论论证与中心流形分析的组合，证明几乎必然避免严格鞍流形。
推导对规则Hurwicz极小点的局部收敛速率：当 γn = Θ(1/n^p) 时，E[||Xn − x*||^2] = O(1/n^p)。
通过对Shekel风险基准测试和CIFAR-10上的ResNet18的数值实验来说明降温策略的效用。

实验结果

研究问题

RQ1在广义步长策略下，SGD是否对非凸目标收敛到几乎必然？
RQ2在随机梯度的条件下，SGD是否以概率1避免严格鞍点/鞍流形？
RQ3使用消失步长 γn = Θ(1/n^p) 时，SGD 收敛到 Hurwicz-正则局部极小点的速率是多少？

主要发现

SGD轨迹几乎必然收敛到目标函数临界集的一个使f取常数的连通分量。
当 γn = Θ(1/n^p) 时，E[||Xn − x*||^2] = O(1/n^p) 收敛到 Hurwicz-正则局部极小点。
在所述假设下，包括非孤立鞍点，SGD以概率1避免严格鞍流形。
在温和假设下建立了SGD轨迹的有界性证明，便于APT框架。
一个实用的降温启发式方法（初始常数步长，然后逐渐变为消失步长）可以提高训练性能，在ResNet/CIFAR上有所展示。
结果通过去除严格有界性要求并允许广义步长，扩展了此前的鞍点避让和收敛性保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。