QUICK REVIEW

[Paper Review] A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Nicolas Le Roux, Mark Schmidt|arXiv (Cornell University)|Feb 28, 2012

Stochastic Gradient Optimization Techniques32 references538 citations

TL;DR

This paper proposes the Stochastic Average Gradient (SAG) method, a novel stochastic optimization algorithm that achieves linear (exponential) convergence for finite-sum problems by maintaining a memory of past gradients. Unlike standard stochastic gradient methods with sublinear convergence, SAG combines low per-iteration cost with fast convergence, outperforming both standard SG and full gradient methods in practice.

ABSTRACT

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly.

Motivation & Objective

To address the limitation of standard stochastic gradient methods, which achieve only sublinear convergence for finite-sum problems.
To develop an algorithm that maintains the low iteration cost of stochastic methods while achieving the linear convergence rate of full gradient methods.
To enable faster training and test error reduction in machine learning applications by exploiting finite dataset structure.
To provide a theoretically grounded method that achieves exponential convergence using only unbiased gradient estimates with memory of past gradients.

Proposed method

The SAG method uses a memory of the most recently computed gradients for each training example, storing them in a buffer.
At each iteration, a random training example is selected, and only its gradient is recomputed; others are retrieved from memory.
The update rule combines all stored gradients using a step size, forming an unbiased estimate of the full gradient.
The method maintains a running average of gradients, ensuring convergence without recomputing all gradients at each step.
It uses a constant step size and achieves linear convergence under strong convexity and smoothness assumptions.
The algorithm is a randomized variant of the incremental aggregated gradient (IAG) method, designed for finite training sets.

Experimental results

Research questions

RQ1Can a stochastic optimization method achieve linear convergence for finite-sum problems while preserving low per-iteration cost?
RQ2How does maintaining a memory of past gradients affect convergence speed compared to standard stochastic gradient methods?
RQ3What is the theoretical convergence rate of a method that combines stochastic updates with gradient memory in finite-sum optimization?
RQ4Does the proposed method outperform standard stochastic and full gradient methods in terms of training and test error reduction?
RQ5Under what conditions does the SAG method achieve faster convergence than coordinate descent or accelerated gradient methods?

Key findings

The SAG method achieves a linear (exponential) convergence rate, unlike standard stochastic gradient methods that converge sublinearly.
The convergence rate of SAG is faster than that of standard stochastic gradient methods, which are known to be optimal under general unbiased gradient access.
Numerical experiments show SAG dramatically outperforms standard algorithms in reducing both training and test error.
For problems with $ n \gg p $, SAG can converge faster than coordinate descent methods, especially when $ m_{\sigma} \gg m'_{\sigma} $.
The method achieves a convergence rate of $ \exp(-1/64) $ per $ n $ iterations under favorable conditions, outperforming coordinate descent when $ n $ is large.
SAG achieves faster convergence than full gradient methods in terms of effective passes through the data, due to its low-cost iterations and fast convergence.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.