QUICK REVIEW

[Paper Review] Beyond Convexity: Stochastic Quasi-Convex Optimization

Elad Hazan, Kfir Y. Levy|arXiv (Cornell University)|Jul 8, 2015

Stochastic Gradient Optimization Techniques11 references50 citations

TL;DR

This paper introduces Stochastic Normalized Gradient Descent (SNGD) for optimizing locally-quasi-convex and locally-Lipschitz functions, extending the applicability of gradient-based methods beyond convexity. It proves SNGD converges to an $C2$-optimal solution in $O(1/\epsilon^2)$ iterations, with convergence requiring a minimal minibatch size to prevent divergence due to gradient explosion or plateau issues.

ABSTRACT

Stochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efficiently using Stochastic Gradient Descent (SGD). The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the con- cept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for first-order optimization methods such as gradient descent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.

Motivation & Objective

To extend stochastic optimization beyond convex functions to a broader class of non-convex problems.
To address the limitations of SGD in non-convex settings, particularly gradient explosion and plateaus.
To formalize a new optimization setup based on locally-quasi-convex and locally-Lipschitz functions.
To analyze the convergence of a stochastic normalized gradient descent (SNGD) algorithm under these conditions.
To establish a theoretical lower bound on the required minibatch size for SNGD convergence.

Proposed method

Proposes a stochastic version of Normalized Gradient Descent (SNGD), which updates based on gradient direction rather than magnitude.
Introduces the concept of local-quasi-convexity, generalizing unimodal functions to allow certain saddle points and plateaus.
Imposes a local-Lipschitz condition, allowing unbounded gradients far from the optimum while ensuring boundedness near the minimum.
Uses minibatch gradient estimation with a minimal batch size to stabilize updates and prevent divergence.
Analyzes convergence via a Markov chain model on a discrete lattice, proving absorption probability bounds.
Employs a constant step size $\eta = \epsilon / G$, where $G$ is a bound on gradient magnitude.

Experimental results

Research questions

RQ1Can stochastic gradient methods be provably effective for non-convex problems beyond convexity?
RQ2What conditions on the objective function allow for convergence of normalized gradient descent in stochastic settings?
RQ3Why does standard SGD fail in the presence of gradient plateaus or explosions, and how can this be mitigated?
RQ4What is the minimal minibatch size required for SNGD to converge, and why is it necessary?
RQ5Can SNGD achieve the same convergence rate as SGD for convex problems in a broader class of non-convex functions?

Key findings

SNGD converges to an $\epsilon$-optimal solution in $O(1/\epsilon^2)$ iterations for locally-quasi-convex and locally-Lipschitz functions.
The algorithm provably requires a minimal minibatch size; smaller batches may cause divergence due to unstable gradient estimates.
For functions smooth in an $\Omega(\sqrt{\epsilon})$-region around the optimum, SNGD achieves a faster $O(1/\epsilon)$ convergence rate.
The probability of SNGD ever reaching an $\epsilon$-optimal solution is bounded above by $\left(\frac{1}{4}\right)^{9}$ when $\epsilon \leq 0.1$, under the given setup.
Empirical results show SNGD performs comparably to Nesterov’s accelerated method on MNIST with a single hidden layer network.
Increasing minibatch size significantly improves SNGD’s convergence performance, supporting the theoretical need for larger batches.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.