QUICK REVIEW

[论文解读] Three Mechanisms of Weight Decay Regularization

Guodong Zhang, Chaoqi Wang|arXiv (Cornell University)|Oct 29, 2018

Neural Networks and Applications参考文献 20被引用 55

一句话总结

该论文指出三种不同的机制，通过权重衰减正则化在不同优化器和架构上提升泛化能力：更高的有效学习率、近似雅可比范数正则化，以及对二阶方法的有效阻尼降低。

ABSTRACT

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$ regularization. Literal weight decay has been shown to outperform $L_2$ regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.

研究动机与目标

Investigate why weight decay improves generalization beyond traditional L2 interpretation.
Compare weight decay with L2 regularization across SGD, Adam, and K-FAC on CNN architectures.
Elucidate how weight decay interacts with Batch Normalization and different optimizers to affect training dynamics.

提出的方法

Analyze the effect of weight decay versus L2 regularization on SGD, Adam, and K-FAC (GN and Fisher variants).
Examine BN-influenced networks to uncouple representation constraints from weight scales.
Derive and test interpretations: effective learning rate, Gauss-Newton / Jacobian norms, and damping in second-order updates.
Empirically measure effective learning rate, Jacobian norms, and damping terms during training on CIFAR-10/100 with VGG and ResNet architectures.

实验结果

研究问题

RQ1What mechanisms explain weight decay's regularization effect across different optimizers and BN-enabled architectures?
RQ2How does weight decay compare to L2 regularization in SGD, Adam, and K-FAC in terms of generalization performance?
RQ3Can the three identified mechanisms (effective learning rate, Jacobian norm regularization, damping control) account for observed generalization gaps?
RQ4What role does BN play in mediating weight decay's impact on training dynamics?

主要发现

Weight decay consistently improves generalization and often outperforms L2 regularization when they differ.
Weight decay reduces generalization gaps between first- and second-order optimizers and between small and large batches.
Weight decay improves performance even for BN-enabled networks, where it does not constrain capacity in the usual sense.
Weight decay yields a strong boost for K-FAC, especially when BN is disabled, by enhancing second-order behavior.
机制 I: 在带有 SGD/Adam 的 BN 网络中，权重衰减通过权重缩放 increasing the effective learning rate，从而放大梯度噪声正则化。
机制 II: 对于 K-FAC，权重衰减近似通过 Gauss-Newton 范数正则化输入-输出雅可比，且与雅可比范数和泛化相关。
机制 III: 在带有 BN 的 K-FAC 网络中，权重衰减降低了有效阻尼，有助于保留二阶特性并改善泛化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。