QUICK REVIEW

[论文解读] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern|arXiv (Cornell University)|Apr 11, 2018

Stochastic Gradient Optimization Techniques参考文献 10被引用 163

一句话总结

Adafactor 引入了一种内存高效的自适应优化器，通过将矩阵参数的二阶矩估计分解为按行和按列的和来实现，从而实现亚线性内存使用，在 Transformer 训练中与 Adam 的性能相当；它还提出了更新裁剪和衰减率调度以稳定训练，并引入相对步长（Adafactor）用于尺度感知的更新。

ABSTRACT

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve comparable results to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage in the optimizer. Finally, we propose scaling the parameter updates based on the scale of the parameters themselves.

研究动机与目标

在模型规模增大时，为自适应梯度方法的内存约束提供动机。
提出一个分解的二阶矩估计器，将矩阵参数的每参数内存从 O(nm) 降至 O(n+m)。
指出当二阶矩衰减过慢时的稳定性问题，并给出解决办法（更新裁剪和自适应衰减）。
证明 Adafactor 在 Transformer 训练中以显著更少内存达到与 Adam 相当的结果。
通过相对步长扩展优化，使更新量随参数量级进行缩放。

提出的方法

引入一个分解的二阶矩表示 V ≈ RS，其中 R ∈ R^{n×k}、S ∈ R^{k×m}，且 k ≪ n,m，用于矩阵参数。
为秩-1（k=1）情况推导解析解，确保与行/列和的指数平滑兼容性（V1_m1_n^T V / 1_n^T V 1_m）。
实现带有分解二阶矩的 Adam，使用按行和按列的累积器 (R_t 和 C_t)，并进行归一化以形成 ̶?factored?ullV_t = (R_t C_t)/(1_n^T R_t)。
提出更新裁剪，在 RMS(U_t) 超过阈值 d 时对未缩放更新进行上限。
提出对二阶矩的衰减计划进行增加（ ̶?ast, ̶?ollow Reddi 等）的做法，以及稳定训练的替代计划。
将 Adafactor 定义为相对步长优化器，其中实际步长 alpha_t 由参数尺度的 RMS 和相对步长 rho_t 计算，参数级别的更新为 U_t = G_t / sqrt(V_hat_t)，并可选地进行裁剪。

实验结果

研究问题

RQ1分解的（按行/按列的）二阶矩估计是否能够在自适应优化器中达到与全二阶矩累积相同的性能？
RQ2通过因式分解降低内存是否会影响收敛性和在如 Transformer 训练等大规模任务中的模型质量？
RQ3在解耦动量并使用自适应学习率时会出现哪些稳定性问题，更新裁剪和衰减率调度如何缓解？
RQ4随参数量级缩放的相对步长是否在不同缩放的参数初始值上提高鲁棒性？
RQ5在现代神经机器翻译任务中，所提出的解决方法（更新裁剪、增加衰减、相对步长）在实践中的相互作用如何？

主要发现

分解的二阶矩估计将矩阵参数的内存从 O(nm) 降至 O(n+m)，同时在 Transformer BLEU 分数上与全累积 Adam 相当。
移除动量可能使训练不稳定，但更新裁剪和适当的衰减计划可以恢复稳定性。
在非预热设置中，带阈值 d 的更新裁剪提高了稳定性；当 d=1 时显著缓解不稳定性（d=2 未显示改进）。
相对步长与 Adafactor 结合时保持了竞争力的性能，并且在不同嵌入参数规模下表现稳健。
对二阶矩的递增衰减计划（例如 1 - t^{-c}）在某些 c 值下取得了稳定且收敛的结果，特别是 c=0.5（及相关变体）与裁剪结合时。
使用秩-1 或分解表示以及所提出的缩放，在实现较低辅助存储的同时训练 Transformer 模型，并达到接近基于 Adam 的基线的 BLEU 分数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。