Skip to main content
QUICK REVIEW

[Paper Review] Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Chi Jin, Praneeth Netrapalli|arXiv (Cornell University)|Nov 28, 2017
Stochastic Gradient Optimization Techniques123 citations
TL;DR

The paper introduces Perturbed Accelerated Gradient Descent (PAGD), a single-loop momentum-based algorithm that finds an ε-second-order stationary point in roughly Õ(1/ε^{7/4}) iterations, faster than GD’s Õ(1/ε^{2}) for nonconvex optimization without Hessians.

ABSTRACT

Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\ ilde{O}(1/\\epsilon^{7/4})$ iterations, faster than the $\ ilde{O}(1/\\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Motivation & Objective

  • Motivate the study of momentum methods in nonconvex optimization and their ability to escape saddle points.
  • Develop a Hessian-free, single-loop algorithm that achieves faster convergence to second-order stationary points than gradient descent.
  • Introduce a Hamiltonian-based analysis and a new improve-or-localize framework to understand acceleration in nonconvex settings.

Proposed method

  • Propose Perturbed Accelerated Gradient Descent (PAGD), a variant of AGD with perturbation and negative curvature exploitation (NCE).
  • Use a Hamiltonian function E_t = f(x_t) + (1/2η)||v_t||^2 to track progress despite nonmonotonic objective values.
  • Add a random perturbation when the gradient is small to escape saddles.
  • Trigger Negative Curvature Exploitation when a quadratic-like instability is detected to decrease the Hamiltonian.
  • Choose parameters η, θ, γ, s, script T, and radius r to guarantee descent of the Hamiltonian.
  • Prove that PAGD achieves ε-second-order stationarity in Õ(ℓ^{1/2}ρ^{1/4}(f(x_0)-f^*)/ε^{7/4}) iterations with high probability.

Experimental results

Research questions

  • RQ1Can momentum-based methods yield faster convergence than GD in nonconvex settings when targeting second-order stationarity?
  • RQ2Is there a Hessian-free, single-loop algorithm that provably finds an ε-second-order stationary point faster than GD?
  • RQ3How can a Hamiltonian framework and perturbations help analyze and guarantee progress of acceleration methods in nonconvex optimization?
  • RQ4What mechanisms (perturbation and negative curvature exploitation) enable efficient escape from strict saddle points?

Key findings

  • PAGD achieves an ε-second-order stationary point in Õ(ℓ^{1/2}ρ^{1/4}(f(x_0)-f^*)/ε^{7/4}) iterations, faster than GD.
  • PAGD is Hessian-free and single-loop, unlike prior nested-loop Hessian-based methods.
  • PAGD matches or improves convergence rates for finding first-order stationary points over standard GD in the nonconvex setting.
  • Introduction of a computable Hamiltonian that monotonically decreases under PAGD, enabling progress tracking in nonconvex optimization.
  • Development of the improve-or-localize framework to analyze long-term behavior and acceleration effects.
  • Perturbation and Negative Curvature Exploitation steps are simple to implement and yield guaranteed Hamiltonian decrease.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.