QUICK REVIEW

[Paper Review] Nonlinear Acceleration of Stochastic Algorithms

Damien Scieur, Alexandre d’Aspremont|arXiv (Cornell University)|Jun 22, 2017

Stochastic Gradient Optimization Techniques17 citations

TL;DR

This paper introduces a nonlinear acceleration technique for stochastic optimization algorithms by extrapolating iterates from stochastic gradient methods using a linear combination of past iterates, achieving faster convergence without requiring knowledge of strong convexity parameters. The method significantly improves performance on SGD, SAGA, SVRG, and Katyusha across multiple datasets, demonstrating both theoretical convergence bounds and practical gains in training loss and test accuracy.

ABSTRACT

Extrapolation methods use the last few iterates of an optimization algorithm to produce a better estimate of the optimum. They were shown to achieve optimal convergence rates in a deterministic setting using simple gradient iterates. Here, we study extrapolation methods in a stochastic setting, where the iterates are produced by either a simple or an accelerated stochastic gradient algorithm. We first derive convergence bounds for arbitrary, potentially biased perturbations, then produce asymptotic bounds using the ratio between the variance of the noise and the accuracy of the current point. Finally, we apply this acceleration technique to stochastic algorithms such as SGD, SAGA, SVRG and Katyusha in different settings, and show significant performance gains.

Motivation & Objective

To extend nonlinear extrapolation techniques—previously effective in deterministic settings—to stochastic optimization with noisy gradients.
To analyze convergence bounds under arbitrary, potentially biased perturbations, including stochastic noise in gradient estimates.
To derive asymptotic convergence rates based on the ratio between noise variance and current iterate accuracy.
To empirically validate the acceleration method on stochastic algorithms like SGD, SAGA, SVRG, and Katyusha across diverse datasets and settings.

Proposed method

The method applies nonlinear extrapolation to iterates generated by stochastic first-order oracle updates, using a linear combination of past iterates to produce a more accurate estimate of the optimal solution.
It generalizes the deterministic nonlinear acceleration framework of Scieur et al. (2016) to handle stochastic perturbations by modeling the iterates as a perturbed version of a linearized system around the optimum.
Convergence bounds are derived by tracking the difference between the true gradient flow and the perturbed iterates, using tools from control theory and polynomial extrapolation.
The coefficients for the linear combination are computed using a data-driven approach based on minimizing the residual error in the linearized model.
Theoretical analysis includes both finite-sample bounds and asymptotic convergence rates dependent on the noise-to-accuracy ratio.
The approach is applied to multiple stochastic algorithms, including SGD, SAGA, SVRG, and Katyusha, with empirical evaluation on image classification and tabular datasets.

Experimental results

Research questions

RQ1Can nonlinear extrapolation techniques, effective in deterministic optimization, be successfully extended to stochastic first-order methods with noisy gradients?
RQ2What are the convergence bounds for nonlinear acceleration under arbitrary, potentially biased perturbations in stochastic settings?
RQ3How does the asymptotic convergence rate of the extrapolated iterates depend on the ratio between noise variance and the current iterate's accuracy?
RQ4To what extent does nonlinear acceleration improve the practical performance of stochastic algorithms like SGD, SAGA, SVRG, and Katyusha?
RQ5Can the extrapolation method be used to accelerate learning rate decay strategies in deep learning without sacrificing convergence?

Key findings

The nonlinear acceleration method achieves asymptotic convergence rates comparable to accelerated deterministic methods, even without prior knowledge of the strong convexity parameter.
Empirical results show significant improvements in training loss and test accuracy when applied to ResNet architectures on CIFAR-10 and CIFAR-100 datasets.
The extrapolated iterates allow for earlier learning rate decay, improving generalization and reducing training time when used as a restart strategy.
On tabular datasets such as Sonar, Madelon, Random, and Sido0, the method consistently outperforms baseline stochastic algorithms across different conditioning levels.
The acceleration is effective across multiple stochastic algorithms, including SAGA, SVRG, and Katyusha, demonstrating broad applicability.
Theoretical analysis confirms that the convergence rate depends on the ratio of noise variance to the current iterate's distance to the optimum, validating the asymptotic behavior.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.