QUICK REVIEW

[论文解读] AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun|arXiv (Cornell University)|Jun 15, 2020

Advanced Neural Network Applications参考文献 66被引用 81

一句话总结

AdamP 引入一个基于投影的更新，用以去除动量优化器中的径向分量，保持尺度不变权重的有效步长，从而在多种任务上带来性能提升。

ABSTRACT

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.

研究动机与目标

动机问题：归一化层带来的尺度不变性使权重具有尺度不变性，在基于动量的优化器下导致有效步长减小。
研究动量如何在尺度不变权重上加速范数增长并降低训练效率。
提出一种简单的基于投影的补救方案（SGDP/AdamP），在保持更新方向的同时稳定有效步长。
在多个基准和架构上演示该方法的有效性。
提供在实际训练流程中应用该方法的实用指导和代码。

提出的方法

对带动量的 SGD/Adam 中尺度不变性对有效步长的影响建模。
推导出在动量下权重范数增长在规范化权重球面上加速有效步长的衰减。
引入一个投影算子 onto 权重切空间，以从更新中移除径向（增大范数）分量。
将 SGDP 和 AdamP 定义为基于动量的优化器，依据与权重的余弦相似性在检测尺度不变权重时按条件应用投影。
认为投影后的更新在归一化权重球上保留有效方向，维持收敛性。
提供带有通道级和层级变体的实用算法（SGDP 和 AdamP）。

实验结果

研究问题

RQ1动量如何与尺度不变权重相互作用，从而在训练期间影响有效学习率？
RQ2通过投影去除更新的径向分量是否可以恢复或保留在有效权重空间上的动量收益？
RQ3在多样化任务和架构上，SGDP 和 AdamP 是否比标准的 SGD/AdamW/Adam 提升性能？
RQ4所提出的投影方法在大规模训练中是否具有足够的计算效率？

主要发现

带有尺度不变权重的动量导致权重范数快速增长，从而使有效步长迅速衰减。
将动量更新简单投影到权重球面的切空间可防止范数累积，同时保持更新方向。
SGDP 和 AdamP 在包括 ImageNet、检索、检测、鲁棒性、音频和语言建模任务在内的 13 个基准上显示出稳定的性能提升。
在若干任务上，AdamP 优于基线，例如图像分类、目标检测、鲁棒性基准和音频分类，开销适中。
在基于 Transformer 的语言建模中，使用权重归一化的 AdamP 能在 WikiText-103 上提升困惑度。
在使用 ℓ2 正规化嵌入的检索基准上，AdamP 相对于 AdamW 在多个数据集上带来提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。