QUICK REVIEW

[论文解读] Understanding AdamW through Proximal Methods and Scale-Freeness

Zhenxun Zhuang, Mingrui Liu|arXiv (Cornell University)|Jan 31, 2022

Neural Networks and Applications被引用 36

一句话总结

该论文表明 AdamW 是近端更新的近似，并且是无尺度的，提供相对于 AdamL2 的优化优势，尤其在深层网络中没有批量归一化时；它还将尺度不变性与降低的条件数联系起来。

ABSTRACT

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as Adam-$\ell_2$). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-$\ell_2$. Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$. Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-$\ell_2$ and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates.

研究动机与目标

激发并理解为什么解耦的权重衰减（AdamW）在泛化和优化方面优于带 L2 正则化的 Adam（Adam-L2）。
从近端优化的角度将 AdamW 与近端更新联系起来，并利用尺度不变性来解释经验优势。
在实际训练场景中，特别是在没有批归一化的非常深的网络中，实证地识别 AdamW 相较于 Adam-L2 的显著优势。
考察在实际非零 epsilon 下，AdamW 的尺度不变性属性的鲁棒性，并将其与深层网络中的更新行为联系起来。

提出的方法

推导并给出 AdamW 对带正则化目标 F(x) = (λ/2)||x||^2 + f(x) 的近端更新的近似。
证明 AdamW 对应于带 M_t = η_t I_d 且 p_t = α m̂_t/(√v̂_t+ε) 的近端更新的一阶泰勒近似。
证明当 ε=0 时，AdamW 与近端更新均具有尺度不变性，而 Adam-L2 在 λ>0 时会失去尺度不变性。
提供理论论证，尺度不变性可以带来自动预条件化，并在某类函数上改善与条件数相关的依赖。
通过缩放损失并观察在没有 Batch Normalization 的网络中的更新稳定性，经验性验证尺度不变性。
在 CIFAR-10/100 上，比较 AdamW、AdamProx 与 Adam-L2，在有无 Batch Normalization 的情况下，使用 ResNet 与 DenseNet 架构。

实验结果

研究问题

RQ1AdamW 是否作为对带正则化目标的近端更新？如果是，在哪些近似条件下？
RQ2尺度不变性如何影响 AdamW 相对于 Adam-L2 的优化行为和收敛？
RQ3在哪些训练设置中（如没有 Batch Normalization 的非常深的网络），AdamW 优于 Adam-L2，原因何在？
RQ4当 ε 为非零时，实际中 AdamW 是否近似尺度不变，以及这一性质的鲁棒性如何？
RQ5在常见学习率 schedules 下，AdamW 与 AdamProx 是否产生相似的优化动态？

主要发现

AdamW 是对近端更新的近似，使用的是完整的正则化项，而不仅仅是其梯度。
AdamW 与近端更新具有尺度不变性（当 ε≈0 时），而当 λ>0 时 Adam-L2 会失去尺度不变性。
尺度不变性提供自动预条件化，降低对某些函数的条件数敏感性。
在没有 Batch Normalization 的情况下，AdamW 在非常深的网络上的训练和测试都显著优于 Adam-L2。
随着网络深度的增加，Adam-L2 的更新尺度比 AdamW 更为多样化，这与 AdamW 的更大准确性提升相关。
在典型学习率调度下，AdamW 近似等同于 AdamProx，支持近端解释。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。