QUICK REVIEW

[Paper Review] On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

Chi Jin, Praneeth Netrapalli|arXiv (Cornell University)|Feb 13, 2019

Stochastic Gradient Optimization Techniques46 references58 citations

TL;DR

This paper analyzes perturbed gradient methods (PGD and PSGD) to efficiently escape saddle points in nonconvex ML, achieving polylogarithmic dimension dependence in finding second-order stationary points.

ABSTRACT

Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.

Motivation & Objective

Motivate the study of nonconvex optimization in machine learning and the gap between theory and practice.
Extend convergence analysis to both deterministic and stochastic settings for nonconvex problems.
Bound iteration complexity as a function of accuracy and dimension.
Show that saddle points can be avoided efficiently using simple perturbation schemes.

Proposed method

Introduce Perturbed Gradient Descent (PGD) by adding Gaussian perturbations to GD updates.
Prove PGD finds ε-second-order stationary points in Õ(ε^{-2}) iterations with polylogarithmic dimension dependence.
Introduce Perturbed Stochastic Gradient Descent (PSGD) and Mini-batch PSGD with isotropic perturbations.
Derive iteration complexity for PSGD to reach ε-second-order stationarity under Lipschitz assumptions or without them.
Provide parameter settings (step size η and perturbation radius r) to achieve the guarantees.
Compare with prior methods and highlight single-loop simplicity versus double-loop alternatives.

Experimental results

Research questions

RQ1Can simple perturbations enable gradient methods to efficiently escape saddle points in high dimensions?
RQ2What is the dimension dependence of convergence to ε-second-order stationary points for GD, SGD, and their perturbed variants?
RQ3Under which gradient/stochastic assumptions do perturbed methods achieve polylogarithmic or linear-in-dimension iteration complexity?

Key findings

Perturbed Gradient Descent (PGD) finds ε-second-order stationary points in Õ(ε^{-2}) iterations, with only polylogarithmic dimension dependence.
Perturbed Stochastic Gradient Descent (PSGD) achieves ε-second-order stationarity in Õ(ε^{-4}) iterations under Lipschitz stochastic gradients, matching first-order rates up to polylog factors.
Without Lipschitzness, PSGD incurs an extra factor of d, achieving Õ(d ε^{-4}) iterations.
When Lipschitz conditions hold, PSGD reduces to rates comparable to SGD for first-order points, up to log factors.
The paper situates second-order stationarity as sufficient for broad classes of nonconvex ML problems where all local minima are global and saddle points are strict.
A simple, single-loop perturbation framework can match or improve upon multi-loop methods in escaping saddle points.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.