QUICK REVIEW

[Paper Review] Scalable Second Order Optimization for Deep Learning

Rohan Anil, Vineet Gupta|arXiv (Cornell University)|Feb 20, 2020

Stochastic Gradient Optimization Techniques49 references29 citations

TL;DR

This paper presents a scalable, hardware-optimized implementation of a second-order adaptive method (a variant of full-matrix Adagrad) that uses factored preconditioners to accelerate deep learning training. By leveraging CPU-accelerator pipelining, efficient matrix root computation, and architectural extensions, it achieves up to 47% wall-clock time reduction on large models like Transformers and BERT, significantly outperforming first-order methods like Adam in both convergence speed and training efficiency.

ABSTRACT

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.

Motivation & Objective

Address the impracticality of second-order methods in large-scale deep learning due to high memory, computation, and communication costs.
Bridge the gap between theoretical advantages of second-order optimization and practical deployment at scale.
Enable efficient, scalable second-order training on modern heterogeneous hardware (CPU + accelerators) for state-of-the-art models.
Overcome numerical and infrastructural challenges in implementing full-matrix preconditioning, particularly for large layers like embeddings.
Demonstrate significant convergence and wall-clock time improvements over first-order baselines like Adam and LAMB on diverse large-scale tasks.

Proposed method

Design a pipelined optimization pipeline that offloads preconditioner computation to the CPU while accelerators handle forward/backward passes.
Replace expensive spectral decompositions (e.g., SVD) with an efficient, numerically stable iterative method for computing matrix roots (e.g., $L^{-1/4}$, $R^{-1/4}$).
Extend the Shampoo algorithm to support large, high-rank layers such as embedding matrices by modifying preconditioner structure and update rules.
Use a per-layer learning rate derived from AdaGrad to stabilize training and enable larger learning rates than AdaGrad.
Implement cross-replica sharding of preconditioning statistics to reduce redundant computation across TPU cores.
Integrate preconditioned gradient computation into the training loop with minimal latency, leveraging asynchronous CPU offloading.

Experimental results

Research questions

RQ1Can second-order adaptive methods be made practical at scale for modern deep learning workloads?
RQ2How can the computational and memory overhead of full-matrix preconditioning be reduced without sacrificing convergence benefits?
RQ3What architectural and algorithmic optimizations are required to make second-order methods efficient on CPU-accelerator systems?
RQ4To what extent can second-order methods reduce training time and steps compared to first-order baselines like Adam and LAMB?
RQ5How do numerical stability and scalability trade-offs affect the design of large-scale second-order optimizers?

Key findings

On the WMT’14 English-to-French translation task, Shampoo reduced training time by 45% (12.0 hrs to 6.7 hrs) for the standard Transformer and 37% (47.0 hrs to 29.5 hrs) for the larger Transformer-Big model.
For BERT-Large language modeling at 32K batch size, Shampoo achieved higher masked-LM accuracy in 16% fewer steps and reduced wall-clock time by 4% (3.8 to 3.65 hours), despite no hyperparameter tuning.
On CIFAR-10 with ResNet-50, Shampoo reached 93.45% accuracy in 143 epochs versus 300 for the baseline, reducing total training time by 42% (1428s to 827s).
On the Criteo 1TB click-through rate prediction task, Shampoo reduced training time from 13 minutes to 8.2 minutes, demonstrating strong efficiency on large-scale sparse models.
The method achieved significant speedups not only in step count but also in actual wall-clock time, even on models where preconditioner computation was not yet fully optimized.
The current implementation shows a 14% increase in step time for BERT-Large, indicating that further optimization (e.g., via cross-replica sharding) could yield even greater gains.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.