Skip to main content
QUICK REVIEW

[Paper Review] Gradient Descent Converges to Minimizers

Jason D. Lee, Max Simchowitz|arXiv (Cornell University)|Feb 16, 2016
Stochastic Gradient Optimization Techniques28 references123 citations
TL;DR

Gradient descent with random initialization and small constant step size almost surely converges to a local minimizer, not a saddle point, for functions with the strict saddle property.

ABSTRACT

We show that gradient descent converges to a local minimizer, almost surely with random initialization. This is proved by applying the Stable Manifold Theorem from dynamical systems theory.

Motivation & Objective

  • Motivate non-convex optimization by addressing saddle point obstacles.
  • Prove that gradient descent with random initialization avoids strict saddles under mild regularity.
  • Show convergence to local minimizers rather than saddle points or infinity under a small step size.
  • Link the analysis to invariant manifold theory and proximal point inversion.

Proposed method

  • Model the gradient method as a discrete dynamical system with map g(x) = x - α∇f(x).
  • Use the Jacobian Dg(x) = I - α∇²f(x) and the Stable Manifold Theorem to characterize local dynamics near critical points.
  • Prove g is a diffeomorphism for α < 1/L and relate global behavior to the local stable set W^s_loc via g^{-k}.
  • Apply the proximal point interpretation of the inverse gradient map to construct g^{-1} and show measure-zero stable sets for strict saddles.
  • Derive implications for convergence by connecting local geometry with global iterates and using Lojasiewicz-type inequalities for rates.

Experimental results

Research questions

  • RQ1Do gradient descent iterates converge to saddle points under random initialization?
  • RQ2Under the strict saddle property, do gradient methods avoid saddles and converge to local minima with constant step size?
  • RQ3What role does the step size (α < 1/L) play in ensuring convergence to minimizers?
  • RQ4Can the proximal point interpretation extend the result to other descent-like algorithms?

Key findings

  • Gradient descent with a random start and 0 < α < 1/L almost surely avoids strict saddle points.
  • The global stable set of a strict saddle has measure zero, implying convergence to local minima or divergence to infinity is almost sure under random initialization.
  • If the iterates are bounded, they converge to a local minimizer rather than saddle points under the given conditions.
  • The result extends to the proximal point algorithm because its gradient map is a diffeomorphism with the inverse given by gradient ascent on -f.
  • Corollaries show that if saddle points are countable or isolated, the probability of converging to any saddle is zero, and with limit existence, convergence to a local minimizer is almost sure.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.