QUICK REVIEW

[论文解读] Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang, James Lucas|arXiv (Cornell University)|Jul 19, 2019

Stochastic Gradient Optimization Techniques参考文献 44被引用 381

一句话总结

Lookahead 优化器通过为快速权重更新 k 步，然后朝快速权重向慢权重更新一次，降低方差并在几乎无额外开销的情况下改善收敛。

ABSTRACT

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

研究动机与目标

引入 Lookahead，一种可与现有优化器集成的两层优化方法。
证明 Lookahead 能降低方差并提升神经网络训练的稳定性。
在多样化任务上展示经验性提升，且无需大量超参数调优。
分析收敛性质并提供选择慢权重步长 α 的指南。

提出的方法

维持慢权重 φ 和快速权重 θ，并在每 k 次内更新时同步。
使用任意标准优化器 A 在小批量上更新快速权重 θ。
在经过 k 次内更新后，通过 φ ← φ + α(θ − φ) 将慢权重向 θ 方向更新，并将 θ 重置为 φ。
提供一个有原理的自适应选择或固定的 α，并给出基于二次近似的理论依据。
证明在每个内循环中，慢权重遵循快速权重的指数移动平均。
讨论相对于内部优化器的额外计算开销为 O((k+1)/k)，并需要一次额外的参数拷贝。

实验结果

研究问题

RQ1当围绕 SGD 或 Adam 这样的标准优化器时，Lookahead 是否能提升收敛速度和稳定性？
RQ2在理论和实践中，Lookahead 的慢权重步长 α 如何影响收敛性和稳定性？
RQ3Lookahead 对于不同架构和任务中的超参数选择（如 k 和 α）是否具有鲁棒性？
RQ4在带噪声和确定性二次模型中，Lookahead 提供的收敛性质和方差降低是多少？

主要发现

优化器	CIFAR-10	CIFAR-100
SGD	95.23\u00b1.0.19	78.24\u00b1.0.18
Polyak	95.26\u00b1.0.04	77.99\u00b1.0.42
Adam	94.84\u00b10.16	76.88\u00b10.39
Lookahead	95.27\u00b10.06	78.34\u00b10.05

在 CIFAR、ImageNet、语言模型和机器翻译中，与 SGD 或 Adam 结合时，Lookahead 能实现更快的收敛且通常具有更好的泛化能力。
慢权重的更新充当最终快速权重的 EMA，降低方差并提升稳定性。
Lookahead 对内部优化器选择和超参数具有鲁棒性，固定的 α 在各任务中表现良好。
在带噪声的二次模型中，在合适设置下，相同学习率时，Lookahead 的稳态方差严格低于 SGD。
确定性二次分析表明，在欠阻尼情形下，Lookahead 可提高收敛速率。
实证结果表明，在最小化超参数调优的情况下，Lookahead 可以实现具有竞争力或更优的最终准确率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。