QUICK REVIEW

[論文レビュー] Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang, James Lucas|arXiv (Cornell University)|Jul 19, 2019

Stochastic Gradient Optimization Techniques参考文献 44被引用数 381

ひとこと要約

Lookahead オプティマイザは任意の標準 inner オプティマイザを包み込み、k 回の速い重みの更新の後に遅い重みを速い重みに向けて更新し、分散を低減し最小限のオーバーヘッドで収束を改善します。

ABSTRACT

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

研究の動機と目的

Introduce Lookahead, a two-tier optimization method that integrates with existing optimizers.
Show that Lookahead reduces variance and improves stability in neural network training.
Demonstrate empirical gains across diverse tasks with minimal hyperparameter tuning.
Analyze convergence properties and provide guidelines for selecting the slow weights step size α.

提案手法

遅い重み φ と速い重み θ を維持し、k 回の内部更新ごとに同期させる。
ミニバッチ上で任意の標準オプティマイザ A を用いて速い重み θ を更新する。
k 回の内部更新の後、遅い重み φ を θ に向けて φ ← φ + α(θ − φ) とし、 θ を φ にリセットする。
二次近似に基づく正当な根拠を持つ適応的に選択されたまたは固定の α を提供する。
各内部ループ内で遅い重みが速い重みの EMA に従うことを示す。
内部オプティマイザに対する計算オーバーヘッドを O((k+1)/k) とし、1 つの追加パラメータコピーを要することを議論する。

実験結果

リサーチクエスチョン

RQ1Does Lookahead improve convergence speed and stability when wrapped around standard optimizers like SGD or Adam?
RQ2How does the Lookahead slow weights step size α influence convergence and stability, both in theory and practice?
RQ3Is Lookahead robust to hyperparameter choices such as k and α across different architectures and tasks?
RQ4What are the convergence properties and variance reductions Lookahead provides in noisy and deterministic quadratic models?

主な発見

Optimizer	CIFAR-10	CIFAR-100
SGD	95.23\u00b1.0.19	78.24\u00b1.0.18
Polyak	95.26\u00b1.0.04	77.99\u00b1.0.42
Adam	94.84\u00b10.16	76.88\u00b10.39
Lookahead	95.27\u00b10.06	78.34\u00b10.05

Lookahead yields faster convergence and often better generalization when combined with SGD or Adam across CIFAR, ImageNet, language models, and machine translation.
The slow weights update acts as an EMA of final fast weights, reducing variance and improving stability.
Lookahead is robust to inner optimizer choices and hyperparameters, with fixed α performing well across tasks.
In a noisy quadratic model, Lookahead’s steady-state variance is strictly lower than SGD’s for the same learning rate, given appropriate settings.
Deterministic quadratic analysis shows Lookahead can improve convergence rates in under-damped regimes.
Empirical results show Lookahead achieving competitive or superior final accuracies with minimal hyperparameter tuning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。