QUICK REVIEW

[Paper Review] Global Convergence of Online Limited Memory BFGS

Aryan Mokhtari, Alejandro Ribeiro|arXiv (Cornell University)|Sep 6, 2014

Stochastic Gradient Optimization Techniques32 references132 citations

TL;DR

This paper establishes global convergence for an online limited memory BFGS (oL-BFGS) method in stochastic optimization settings, proving almost sure convergence to the optimal solution under bounded Hessian eigenvalues. The method uses stochastic gradients to approximate curvature, and convergence is guaranteed when stepsize parameters satisfy a condition involving the Hessian bounds, outperforming SGD in convergence speed and efficiency.

ABSTRACT

Global convergence of an online (stochastic) limited memory version of the Broyden-Fletcher- Goldfarb-Shanno (BFGS) quasi-Newton method for solving optimization problems with stochastic objectives that arise in large scale machine learning is established. Lower and upper bounds on the Hessian eigenvalues of the sample functions are shown to suffice to guarantee that the curvature approximation matrices have bounded determinants and traces, which, in turn, permits establishing convergence to optimal arguments with probability 1. Numerical experiments on support vector machines with synthetic data showcase reductions in convergence time relative to stochastic gradient descent algorithms as well as reductions in storage and computation relative to other online quasi-Newton methods. Experimental evaluation on a search engine advertising problem corroborates that these advantages also manifest in practical applications.

Motivation & Objective

To establish global convergence of an online limited memory BFGS (oL-BFGS) method for stochastic optimization problems with large-scale machine learning objectives.
To show that bounded eigenvalues of the Hessian of sample functions are sufficient to ensure convergence of curvature approximation matrices with bounded determinants and traces.
To demonstrate that oL-BFGS achieves almost sure convergence to the optimal solution under mild assumptions on the stepsize sequence and Hessian bounds.
To validate the theoretical advantages through numerical experiments on synthetic SVM data and a real-world search engine advertising problem.

Proposed method

The method extends the BFGS quasi-Newton framework to online stochastic settings by using stochastic gradients as descent directions and curvature approximations.
It employs a limited-memory structure to reduce storage and computational cost per iteration, maintaining a low-rank Hessian approximation.
The curvature approximation matrices are shown to have bounded determinants and traces under the assumption of bounded Hessian eigenvalues for sample functions.
A stepsize rule is used where $\epsilon_t = \epsilon_0 T_0 / (T_0 + t)$, ensuring convergence when $2\epsilon_0 T_0 / C > 1$.
Theoretical analysis uses a Lyapunov function and recursive inequalities to bound the expected optimality gap $\mathbb{E}[F(\mathbf{w}_t)] - F(\mathbf{w}^*)$.
Convergence is proven via a recursive bound that decays linearly with a rate dependent on the Hessian bounds and stepsize parameters.

Experimental results

Research questions

RQ1Can global convergence be established for an online limited memory BFGS method in stochastic optimization with only bounded Hessian eigenvalues?
RQ2Does the curvature approximation matrix remain well-conditioned under stochastic gradient updates when Hessian eigenvalues are bounded?
RQ3Can the oL-BFGS method achieve faster convergence than stochastic gradient descent in large-scale machine learning problems?
RQ4What conditions on the stepsize sequence ensure almost sure convergence to the optimal solution?
RQ5Do the theoretical advantages of oL-BFGS manifest in practical applications beyond synthetic data?

Key findings

Global convergence to the optimal solution is proven with probability 1 under the assumption that the Hessian eigenvalues of the sample functions are bounded between $m > 0$ and $M < ∞$.
The curvature approximation matrices used in oL-BFGS have bounded determinants and traces, which is essential for convergence stability.
The expected optimality gap $\mathbb{E}[F(\mathbf{w}_t)] - F(\mathbf{w}^*)$ decays at a linear rate when the stepsize condition $2\epsilon_0 T_0 / C > 1$ is satisfied.
Numerical experiments on synthetic SVM data show that oL-BFGS reduces convergence time compared to SGD and other online quasi-Newton methods.
In a real-world search engine advertising task, oL-BFGS achieves faster convergence with lower storage and computational costs than competing methods.
The method demonstrates robust performance across both ill-conditioned and well-conditioned problems, outperforming SGD in convergence speed while maintaining low memory usage.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.