QUICK REVIEW

[论文解读] Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Byeongho Heo, Sanghyuk Chun|arXiv (Cornell University)|Jun 15, 2020

Advanced Neural Network Applications参考文献 19被引用 22

一句话总结

该论文指出，当与批量归一化配合使用时，基于动量的优化器（如SGD和Adam）会因尺度不变性而加剧权重范数的不可控增长，导致有效学习率被抑制，性能欠佳。为此，论文提出了SGDP和AdamP——经过改进的优化器，通过去除权重更新中的径向分量来防止不必要的范数增加，从而在多种深度学习任务中提升了训练稳定性和性能。

ABSTRACT

Normalization techniques, such as batch normalization (BN), have led to significant improvements in deep neural network performances. Prior studies have analyzed the benefits of the resulting scale invariance of the weights for the gradient descent (GD) optimizers: it leads to a stabilized training due to the auto-tuning of step sizes. However, we show that, combined with the momentum-based algorithms, the scale invariance tends to induce an excessive growth of the weight norms. This in turn overly suppresses the effective step sizes during training, potentially leading to sub-optimal performances in deep neural networks. We analyze this phenomenon both theoretically and empirically. We propose a simple and effective solution: at each iteration of momentum-based GD optimizers (e.g. SGD or Adam) applied on scale-invariant weights (e.g. Conv weights preceding a BN layer), we remove the radial component (i.e. parallel to the weight vector) from the update vector. Intuitively, this operation prevents the unnecessary update along the radial direction that only increases the weight norm without contributing to the loss minimization. We verify that the modified optimizers SGDP and AdamP successfully regularize the norm growth and improve the performance of a broad set of models. Our experiments cover tasks including image classification and retrieval, object detection, robustness benchmarks, and audio classification. Source code is available at this https URL.

研究动机与目标

探究在与批量归一化结合时，基于动量的优化器对权重范数增长的负面影响。
分析批量归一化层中的尺度不变性如何在训练过程中导致权重范数过度增加。
解决由此引发的有效步长抑制问题，该问题会阻碍收敛和模型性能。
提出一种简单而有效的方法，对权重范数增长进行正则化，且无需修改网络架构。
在多种深度学习任务和模型上对所提方法进行实证验证。

提出的方法

在每次优化步骤中，该方法在应用更新前，将更新向量的径向分量（即与当前权重向量平行的分量）投影出去。
通过从更新向量中减去其在权重向量上的投影来实现，从而有效去除仅增加范数而不改善损失的更新。
该方法被应用于标准的基于动量的优化器（如SGD和Adam），分别得到SGDP和AdamP。
该修改轻量且与现有训练流程兼容，除标准优化器设置外无需额外超参数调优。
该方法在保留批量归一化有益的尺度不变性的同时，防止了不稳定的范数增长。
径向分量的去除在数学上等价于强制更新向量位于权重范数球面的切空间中。

实验结果

研究问题

RQ1批量归一化与基于动量的优化器结合时，如何影响训练过程中的权重范数动态？
RQ2尽管存在尺度不变性，为何权重范数过度增长会降低深度网络的模型性能？
RQ3从优化过程中去除径向更新是否能稳定训练并改善泛化性能？
RQ4所提出的SGDP和AdamP优化器与标准SGD和Adam相比，在多种架构和任务中表现如何？
RQ5该方法在鲁棒性基准和下游任务上是否保持或提升性能？

主要发现

所提出的SGDP和AdamP优化器成功对批量归一化网络中的权重范数增长进行了正则化，防止了训练过程中的过度增加。
改进后的优化器在图像分类、目标检测和语音分类任务中实现了更好的泛化能力和更快的收敛速度。
SGDP和AdamP在多个基准测试中（包括ImageNet和Cifar-100）优于标准SGD和Adam，Top-1准确率持续提升。
该方法提高了对分布偏移和对抗性样本的鲁棒性，表现出在扰动下的更强泛化能力。
径向更新的去除计算开销极低，且在多种模型（包括ResNets、Vision Transformers和EfficientNet）中均表现有效。
实证结果证实，由范数增长导致的有效步长抑制问题得到缓解，从而实现了更稳定高效的训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。