[Paper Review] On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems
The paper proves almost sure convergence of SGD for non-convex objectives under broad step-size schedules, shows SGD avoids strict saddles with probability 1, and derives a 1/n^p convergence rate to Hurwicz-regular minima, supported by experiments.
This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $Θ(1/n^p)$ step-size schedule. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.
Motivation & Objective
- Establish almost sure convergence of SGD trajectories for non-convex objectives under broad step-size schedules.
- Demonstrate that SGD avoids strict saddle points/manifolds with probability 1.
- Characterize the rate of convergence to Hurwicz-regular local minima under vanishing step-sizes.
- Provide practical insights into step-size tuning, including a cooldown strategy, supported by experiments.
Proposed method
- Model SGD as a Robbins–Monro discretization of gradient flow and study it as an asymptotic pseudotrajectory (APT) of the gradient dynamics (GD).
- Prove boundedness of SGD trajectories (precompactness) under mild regularity assumptions and a range of step-sizes γn = Θ(1/n^p).
- Show almost sure convergence to a connected component of the critical set where f is constant by leveraging APT theory and Lyapunov properties.
- Demonstrate almost-sure avoidance of strict saddle manifolds via a combination of probabilistic arguments and center-manifold analysis under a uniformly exciting noise assumption.
- Derive a local convergence rate to regular Hurwicz minimizers: E[||Xn − x*||^2] = O(1/n^p) for γn = Θ(1/n^p).
- Support with numerical experiments on Shekel risk benchmark and ResNet18 on CIFAR-10 to illustrate cooldown benefits.
Experimental results
Research questions
- RQ1Does SGD converge almost surely for non-convex objectives under broad step-size policies?
- RQ2Does SGD avoid strict saddle points/manifolds with probability 1 under stochastic gradients?
- RQ3What is the rate at which SGD converges to Hurwicz-regular local minima when using a vanishing step-size γn = Θ(1/n^p)?
Key findings
- SGD trajectories converge almost surely to a connected component of the objective’s critical set where f is constant.
- SGD with γn = Θ(1/n^p) yields E[||Xn − x*||^2] = O(1/n^p) to Hurwicz-regular local minima.
- SGD avoids strict saddle manifolds with probability 1 under the stated assumptions, including non-isolated saddles.
- A boundedness certificate for SGD trajectories is established under mild assumptions, enabling the APT framework.
- A practical cooldown heuristic (initial constant steps, then vanishing step-sizes) can improve training performance, demonstrated on ResNet/CIFAR.
- The results extend prior saddle-avoidance and convergence guarantees by removing strict boundedness requirements and allowing a broad class of step-sizes.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.