QUICK REVIEW

[Paper Review] Gradient Descent Converges to Minimizers

Jason D. Lee, Max Simchowitz|arXiv (Cornell University)|Feb 16, 2016

Stochastic Gradient Optimization Techniques28 references123 citations

TL;DR

Gradient descent with random initialization and small constant step size almost surely converges to a local minimizer, not a saddle point, for functions with the strict saddle property.

ABSTRACT

We show that gradient descent converges to a local minimizer, almost surely with random initialization. This is proved by applying the Stable Manifold Theorem from dynamical systems theory.

Motivation & Objective

Motivate non-convex optimization by addressing saddle point obstacles.
Prove that gradient descent with random initialization avoids strict saddles under mild regularity.
Show convergence to local minimizers rather than saddle points or infinity under a small step size.
Link the analysis to invariant manifold theory and proximal point inversion.

Proposed method

Model the gradient method as a discrete dynamical system with map g(x) = x - α∇f(x).
Use the Jacobian Dg(x) = I - α∇²f(x) and the Stable Manifold Theorem to characterize local dynamics near critical points.
Prove g is a diffeomorphism for α < 1/L and relate global behavior to the local stable set W^s_loc via g^{-k}.
Apply the proximal point interpretation of the inverse gradient map to construct g^{-1} and show measure-zero stable sets for strict saddles.
Derive implications for convergence by connecting local geometry with global iterates and using Lojasiewicz-type inequalities for rates.

Experimental results

Research questions

RQ1Do gradient descent iterates converge to saddle points under random initialization?
RQ2Under the strict saddle property, do gradient methods avoid saddles and converge to local minima with constant step size?
RQ3What role does the step size (α < 1/L) play in ensuring convergence to minimizers?
RQ4Can the proximal point interpretation extend the result to other descent-like algorithms?

Key findings

Gradient descent with a random start and 0 < α < 1/L almost surely avoids strict saddle points.
The global stable set of a strict saddle has measure zero, implying convergence to local minima or divergence to infinity is almost sure under random initialization.
If the iterates are bounded, they converge to a local minimizer rather than saddle points under the given conditions.
The result extends to the proximal point algorithm because its gradient map is a diffeomorphism with the inverse given by gradient ascent on -f.
Corollaries show that if saddle points are countable or isolated, the probability of converging to any saddle is zero, and with limit existence, convergence to a local minimizer is almost sure.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.