[Paper Review] Sub-Sampled Newton Methods II: Local Convergence Rates
This paper analyzes sub-sampled Newton methods for large-scale optimization, proposing variants that subsample the Hessian and/or gradient to reduce computational cost while preserving local convergence. It establishes locally Q-linear and Q-superlinear convergence rates by leveraging random matrix concentration and approximate matrix multiplication, with convergence independent of problem-specific condition numbers.
Many data-fitting applications require the solution of an optimization problem involving a sum of large number of functions of high dimensional parameter. Here, we consider the problem of minimizing a sum of $n$ functions over a convex constraint set $\mathcal{X} \subseteq \mathbb{R}^{p}$ where both $n$ and $p$ are large. In such problems, sub-sampling as a way to reduce $n$ can offer great amount of computational efficiency. Within the context of second order methods, we first give quantitative local convergence results for variants of Newton's method where the Hessian is uniformly sub-sampled. Using random matrix concentration inequalities, one can sub-sample in a way that the curvature information is preserved. Using such sub-sampling strategy, we establish locally Q-linear and Q-superlinear convergence rates. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or Levenberg-type regularization. Finally, in addition to Hessian sub-sampling, we consider sub-sampling the gradient as way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra (RandNLA) to obtain the proper sampling strategy and we establish locally R-linear convergence rates. In such a setting, we also show that a very aggressive sample size increase results in a R-superlinearly convergent algorithm. While the sample size depends on the condition number of the problem, our convergence rates are problem-independent, i.e., they do not depend on the quantities related to the problem. Hence, our analysis here can be used to complement the results of our basic framework from the companion paper, [38], by exploring algorithmic trade-offs that are important in practice.
Motivation & Objective
- To develop efficient second-order optimization methods for large-scale problems with high-dimensional parameters and many data points.
- To analyze the local convergence behavior of sub-sampled Newton methods where the Hessian is approximated via random sub-sampling.
- To investigate the impact of regularization on sub-sampled Hessian matrices and its effect on convergence rates.
- To extend the analysis to fully stochastic variants where both gradient and Hessian are sub-sampled.
- To provide convergence guarantees that are independent of problem-specific condition numbers, enabling broader applicability to big data problems.
Proposed method
- Uses uniform sub-sampling of the Hessian to reduce computational cost while preserving curvature information via random matrix concentration inequalities.
- Applies approximate matrix multiplication techniques from randomized numerical linear algebra (RandNLA) to derive optimal sampling strategies for both Hessian and gradient sub-sampling.
- Introduces Levenberg-type (ridge) regularization and spectrum modification to stabilize early iterations, with theoretical justification for their limited utility in later phases.
- Establishes error recursion with composite behavior: quadratic dominance far from optimum, transitioning to linear near the solution.
- Imposes exact solution of sub-problems at each iteration to ensure theoretical convergence guarantees, though this is noted as a computational bottleneck.
- Analyzes both independent and simultaneous sampling strategies for Hessian and gradient sub-sampling, showing that progressive increase in sample size enables R-superlinear convergence.
Experimental results
Research questions
- RQ1Under what conditions does Hessian sub-sampling preserve local convergence properties of Newton's method?
- RQ2How does regularization of the sub-sampled Hessian affect convergence rates, and when is it beneficial?
- RQ3Can both Hessian and gradient be sub-sampled simultaneously while maintaining local convergence guarantees?
- RQ4What sampling strategy ensures locally R-linear or R-superlinear convergence in a fully stochastic Newton method?
- RQ5How do convergence rates depend on problem-specific parameters such as condition number, and can they be made problem-independent?
Key findings
- Sub-sampled Newton methods with full gradient and uniformly sub-sampled Hessian achieve locally Q-linear convergence, with error recursion transitioning from quadratic to linear dominance as iterates approach the optimum.
- By progressively increasing the Hessian sub-sample size, the method achieves locally Q-superlinear convergence, demonstrating improved asymptotic behavior.
- Regularization of the sub-sampled Hessian (via spectrum modification or Levenberg-type) improves early-stage convergence but is suboptimal near the solution, where unregularized sub-sampling performs better.
- When both Hessian and gradient are sub-sampled, the algorithm achieves locally R-linear convergence, with a more aggressive sample size increase enabling R-superlinear convergence.
- All convergence rates are problem-independent, meaning they do not depend on condition numbers or other problem-specific quantities, enhancing generalizability.
- The analysis provides a theoretical foundation for algorithmic trade-offs in practice, balancing computational cost and convergence speed without sacrificing convergence guarantees.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.