[Paper Review] A Stochastic Quasi-Newton Method for Large-Scale Optimization
This paper proposes a stochastic quasi-Newton method for large-scale optimization that improves upon stochastic gradient descent by incorporating reliable curvature information through periodic sub-sampled Hessian-vector products, rather than noisy gradient differences. The method uses limited-memory BFGS updates with a stable, scalable Hessian approximation, achieving faster convergence and better performance on machine learning problems compared to existing stochastic quasi-Newton approaches.
The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.
Motivation & Objective
- To develop a scalable, robust stochastic quasi-Newton method for large-scale machine learning problems where full-batch Hessian computation is infeasible.
- To address the instability of curvature estimates in stochastic quasi-Newton methods caused by noisy gradient differences.
- To enable efficient incorporation of second-order information in stochastic approximation settings without incurring prohibitive computational costs.
- To ensure global convergence for strongly convex functions while maintaining low per-iteration cost through amortized Hessian-vector product computation.
- To outperform existing stochastic quasi-Newton methods like oLBFGS in terms of convergence speed and robustness on large-scale learning problems.
Proposed method
- The method employs the limited-memory BFGS update formula to maintain an inverse Hessian approximation $ H_k $ in $ O(n) $ operations per iteration.
- Curvature information is gathered via sub-sampled Hessian-vector products $ \nabla^2 F(w) v $ at regular intervals $ L $, rather than through gradient differences at every iteration.
- The Hessian-vector products are computed using mini-batches of size $ b_H $, ensuring stable and uniform curvature estimates with controlled noise.
- The algorithm uses a diminishing step size $ \alpha^k = \beta / k $, ensuring convergence under standard convexity assumptions.
- The method avoids the instability of gradient difference-based Hessian estimation by ensuring sample uniformity through shared sampling in Hessian-vector computations.
- The inverse Hessian approximation $ H_k $ is updated only every $ L $ iterations, amortizing the cost of Hessian-vector products while maintaining effective curvature information.
Experimental results
Research questions
- RQ1Can curvature information be reliably extracted in stochastic optimization without relying on noisy gradient differences?
- RQ2How can Hessian-vector products be used effectively to build a stable, scalable quasi-Newton method in the stochastic regime?
- RQ3Does incorporating full Hessian approximations via Hessian-vector products lead to faster convergence than diagonal or no Hessian scaling in stochastic quasi-Newton methods?
- RQ4What is the optimal trade-off between the frequency of Hessian-vector product computation and the quality of curvature approximation?
- RQ5Can the proposed method achieve global convergence in the stochastic setting while maintaining low per-iteration complexity?
Key findings
- The proposed method achieves faster convergence than the stochastic gradient descent method of Robbins-Monro, demonstrating that curvature information significantly improves optimization performance.
- The method outperforms the state-of-the-art stochastic quasi-Newton method oLBFGS on large-scale machine learning problems, as shown by numerical experiments.
- Using Hessian-vector products at regular intervals provides stable curvature estimates, avoiding the noise amplification issues inherent in gradient difference methods.
- The method maintains global convergence for strongly convex functions under standard assumptions, with convergence rate improvements attributed to effective Hessian approximation.
- The computational cost is amortized by using a moderate batch size $ b_H $ for Hessian-vector products and a spacing $ L = 20 $, making the method practical for large-scale problems.
- The algorithm is effective even in non-convex settings when the condition $ s_t^T y_t > 0 $ is enforced, indicating potential for broader applicability.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.