QUICK REVIEW

[Paper Review] Big Batch SGD: Automated Inference using Adaptive Batch Sizes

Soham De, Abhay Kumar Yadav|arXiv (Cornell University)|Oct 18, 2016

Stochastic Gradient Optimization Techniques23 references39 citations

TL;DR

This paper proposes Big Batch SGD, an adaptive optimization method that dynamically increases batch size over time to maintain a constant signal-to-noise ratio in gradient estimates. By stabilizing gradient quality, the method enables constant or automatically adjusted step sizes, eliminating the need for manual learning rate scheduling and achieving performance comparable to tuned SGD with minimal hyperparameter tuning.

ABSTRACT

Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive stepsize selection and automatic stopping. We propose alternative "big batch" SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The resulting methods have similar convergence rates to classical SGD, and do not require convexity of the objective. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Big batch methods are thus easily automated and can run with little or no oversight.

Motivation & Objective

To address the challenge of noisy gradient estimates in classical stochastic gradient descent (SGD), especially as iterates approach convergence.
To eliminate the need for manual learning rate decay schedules in SGD by maintaining a stable signal-to-noise ratio through adaptive batch sizing.
To enable fully automated optimization with minimal user oversight by leveraging high-fidelity gradients from growing batches.
To improve convergence and generalization in non-convex problems such as deep neural networks without requiring expert-tuned hyperparameters.

Proposed method

Adaptively increases the batch size over time to maintain a nearly constant signal-to-noise ratio in stochastic gradient estimates.
Uses a constant stepsize or automated backtracking line search, avoiding the need for vanishing stepsize schedules.
Employs a Barzilai-Borwein curvature-based adaptive stepsize method that leverages low-variance gradients for faster convergence.
Maintains convergence guarantees without requiring convexity of the objective function.
Enables automated stopping criteria in problems satisfying the Polyak-Łojasiewicz inequality due to vanishing approximate gradients near solution.
Amortizes computational overhead of higher-order methods (e.g., L-BFGS) by using more accurate, large-batch gradients.

Experimental results

Research questions

RQ1Can adaptive batch size growth stabilize gradient estimates and enable constant or automatically adjusted step sizes in SGD?
RQ2Does maintaining a constant signal-to-noise ratio in gradients lead to faster convergence and better generalization in non-convex optimization?
RQ3Can big batch SGD eliminate the need for manual learning rate tuning while matching or exceeding the performance of tuned SGD?
RQ4How does big batch SGD compare to adaptive methods like Adadelta and L-BFGS in deep learning benchmarks?
RQ5Can high-fidelity gradients from large batches support automated stopping criteria in optimization?

Key findings

Big Batch SGD with backtracking line search outperforms both fixed-stepsize SGD and Adadelta on CIFAR-10, SVHN, and MNIST, achieving comparable or better test accuracy without hyperparameter tuning.
The method achieves performance on par with finely tuned SGD, eliminating the need for extensive grid searches over learning rate schedules.
Big Batch AdaDelta outperforms standard AdaDelta on large-scale datasets (CIFAR-10 and SVHN), with performance indistinguishable on MNIST.
The Barzilai-Borwein adaptive stepsize method based on big batches converges faster than backtracking line search on convex problems.
Big batch methods enable automated stopping in Polyak-Łojasiewicz problems due to vanishing gradient approximations near convergence.
The approach is highly efficient in distributed settings due to higher computation-to-communication ratios from larger batches.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.