QUICK REVIEW
[Paper Review] Gradient Descent Converges to Minimizers
Jason D. Lee, Max Simchowitz|arXiv (Cornell University)|Feb 16, 2016
Stochastic Gradient Optimization Techniques28 references123 citations
TL;DR
Gradient descent with random initialization and small constant step size almost surely converges to a local minimizer, not a saddle point, for functions with the strict saddle property.
ABSTRACT
We show that gradient descent converges to a local minimizer, almost surely with random initialization. This is proved by applying the Stable Manifold Theorem from dynamical systems theory.
Motivation & Objective
- Motivate non-convex optimization by addressing saddle point obstacles.
- Prove that gradient descent with random initialization avoids strict saddles under mild regularity.
- Show convergence to local minimizers rather than saddle points or infinity under a small step size.
- Link the analysis to invariant manifold theory and proximal point inversion.
Proposed method
- Model the gradient method as a discrete dynamical system with map g(x) = x - α∇f(x).
- Use the Jacobian Dg(x) = I - α∇²f(x) and the Stable Manifold Theorem to characterize local dynamics near critical points.
- Prove g is a diffeomorphism for α < 1/L and relate global behavior to the local stable set W^s_loc via g^{-k}.
- Apply the proximal point interpretation of the inverse gradient map to construct g^{-1} and show measure-zero stable sets for strict saddles.
- Derive implications for convergence by connecting local geometry with global iterates and using Lojasiewicz-type inequalities for rates.
Experimental results
Research questions
- RQ1Do gradient descent iterates converge to saddle points under random initialization?
- RQ2Under the strict saddle property, do gradient methods avoid saddles and converge to local minima with constant step size?
- RQ3What role does the step size (α < 1/L) play in ensuring convergence to minimizers?
- RQ4Can the proximal point interpretation extend the result to other descent-like algorithms?
Key findings
- Gradient descent with a random start and 0 < α < 1/L almost surely avoids strict saddle points.
- The global stable set of a strict saddle has measure zero, implying convergence to local minima or divergence to infinity is almost sure under random initialization.
- If the iterates are bounded, they converge to a local minimizer rather than saddle points under the given conditions.
- The result extends to the proximal point algorithm because its gradient map is a diffeomorphism with the inverse given by gradient ascent on -f.
- Corollaries show that if saddle points are countable or isolated, the probability of converging to any saddle is zero, and with limit existence, convergence to a local minimizer is almost sure.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.