Skip to main content
QUICK REVIEW

[Paper Review] A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Nicolas Le Roux, Mark Schmidt|arXiv (Cornell University)|Feb 28, 2012
Stochastic Gradient Optimization Techniques32 references538 citations
TL;DR

This paper proposes the Stochastic Average Gradient (SAG) method, a novel stochastic optimization algorithm that achieves linear (exponential) convergence for finite-sum problems by maintaining a memory of past gradients. Unlike standard stochastic gradient methods with sublinear convergence, SAG combines low per-iteration cost with fast convergence, outperforming both standard SG and full gradient methods in practice.

ABSTRACT

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly.

Motivation & Objective

  • To address the limitation of standard stochastic gradient methods, which achieve only sublinear convergence for finite-sum problems.
  • To develop an algorithm that maintains the low iteration cost of stochastic methods while achieving the linear convergence rate of full gradient methods.
  • To enable faster training and test error reduction in machine learning applications by exploiting finite dataset structure.
  • To provide a theoretically grounded method that achieves exponential convergence using only unbiased gradient estimates with memory of past gradients.

Proposed method

  • The SAG method uses a memory of the most recently computed gradients for each training example, storing them in a buffer.
  • At each iteration, a random training example is selected, and only its gradient is recomputed; others are retrieved from memory.
  • The update rule combines all stored gradients using a step size, forming an unbiased estimate of the full gradient.
  • The method maintains a running average of gradients, ensuring convergence without recomputing all gradients at each step.
  • It uses a constant step size and achieves linear convergence under strong convexity and smoothness assumptions.
  • The algorithm is a randomized variant of the incremental aggregated gradient (IAG) method, designed for finite training sets.

Experimental results

Research questions

  • RQ1Can a stochastic optimization method achieve linear convergence for finite-sum problems while preserving low per-iteration cost?
  • RQ2How does maintaining a memory of past gradients affect convergence speed compared to standard stochastic gradient methods?
  • RQ3What is the theoretical convergence rate of a method that combines stochastic updates with gradient memory in finite-sum optimization?
  • RQ4Does the proposed method outperform standard stochastic and full gradient methods in terms of training and test error reduction?
  • RQ5Under what conditions does the SAG method achieve faster convergence than coordinate descent or accelerated gradient methods?

Key findings

  • The SAG method achieves a linear (exponential) convergence rate, unlike standard stochastic gradient methods that converge sublinearly.
  • The convergence rate of SAG is faster than that of standard stochastic gradient methods, which are known to be optimal under general unbiased gradient access.
  • Numerical experiments show SAG dramatically outperforms standard algorithms in reducing both training and test error.
  • For problems with $ n \gg p $, SAG can converge faster than coordinate descent methods, especially when $ m_{\sigma} \gg m'_{\sigma} $.
  • The method achieves a convergence rate of $ \exp(-1/64) $ per $ n $ iterations under favorable conditions, outperforming coordinate descent when $ n $ is large.
  • SAG achieves faster convergence than full gradient methods in terms of effective passes through the data, due to its low-cost iterations and fast convergence.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.