QUICK REVIEW

[Paper Review] On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Panayotis Mertikopoulos, Nadav Hallak|arXiv (Cornell University)|Jun 19, 2020

Stochastic Gradient Optimization Techniques38 references37 citations

TL;DR

The paper proves almost sure convergence of SGD for non-convex objectives under broad step-size schedules, shows SGD avoids strict saddles with probability 1, and derives a 1/n^p convergence rate to Hurwicz-regular minima, supported by experiments.

ABSTRACT

This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $Θ(1/n^p)$ step-size schedule. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.

Motivation & Objective

Establish almost sure convergence of SGD trajectories for non-convex objectives under broad step-size schedules.
Demonstrate that SGD avoids strict saddle points/manifolds with probability 1.
Characterize the rate of convergence to Hurwicz-regular local minima under vanishing step-sizes.
Provide practical insights into step-size tuning, including a cooldown strategy, supported by experiments.

Proposed method

Model SGD as a Robbins–Monro discretization of gradient flow and study it as an asymptotic pseudotrajectory (APT) of the gradient dynamics (GD).
Prove boundedness of SGD trajectories (precompactness) under mild regularity assumptions and a range of step-sizes γn = Θ(1/n^p).
Show almost sure convergence to a connected component of the critical set where f is constant by leveraging APT theory and Lyapunov properties.
Demonstrate almost-sure avoidance of strict saddle manifolds via a combination of probabilistic arguments and center-manifold analysis under a uniformly exciting noise assumption.
Derive a local convergence rate to regular Hurwicz minimizers: E[||Xn − x*||^2] = O(1/n^p) for γn = Θ(1/n^p).
Support with numerical experiments on Shekel risk benchmark and ResNet18 on CIFAR-10 to illustrate cooldown benefits.

Experimental results

Research questions

RQ1Does SGD converge almost surely for non-convex objectives under broad step-size policies?
RQ2Does SGD avoid strict saddle points/manifolds with probability 1 under stochastic gradients?
RQ3What is the rate at which SGD converges to Hurwicz-regular local minima when using a vanishing step-size γn = Θ(1/n^p)?

Key findings

SGD trajectories converge almost surely to a connected component of the objective’s critical set where f is constant.
SGD with γn = Θ(1/n^p) yields E[||Xn − x*||^2] = O(1/n^p) to Hurwicz-regular local minima.
SGD avoids strict saddle manifolds with probability 1 under the stated assumptions, including non-isolated saddles.
A boundedness certificate for SGD trajectories is established under mild assumptions, enabling the APT framework.
A practical cooldown heuristic (initial constant steps, then vanishing step-sizes) can improve training performance, demonstrated on ResNet/CIFAR.
The results extend prior saddle-avoidance and convergence guarantees by removing strict boundedness requirements and allowing a broad class of step-sizes.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.