QUICK REVIEW

[Paper Review] AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun|arXiv (Cornell University)|Jun 15, 2020

Advanced Neural Network Applications66 references81 citations

TL;DR

AdamP introduces a projection-based update to remove the radial component in momentum optimizers, preserving effective step sizes for scale-invariant weights and yielding performance gains across diverse tasks.

ABSTRACT

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.

Motivation & Objective

Motivate the issue: scale invariance from normalization layers causes weights to be scale-invariant, leading to reduced effective step sizes under momentum-based optimizers.
Investigate how momentum accelerates norm growth at scale-invariant weights and degrades training efficiency.
Propose a simple projection-based remedy (SGDP/AdamP) that preserves update directions while stabilizing effective step sizes.
Demonstrate the method's effectiveness across multiple benchmarks and architectures.
Provide practical guidance and code for applying the approach in real-world training pipelines.

Proposed method

Model the effect of scale invariance on effective step sizes in SGD/Adam with momentum.
Derive that weight norm growth under momentum accelerates decay of effective steps on the sphere of normalized weights.
Introduce a projection operator onto the tangent space of the weight to remove radial (norm-increasing) components from updates.
Define SGDP and AdamP as momentum-based optimizers that apply the projection conditionally based on cosine similarity to detect scale-invariant weights.
Argue that projected updates preserve effective directions on the normalized weight sphere, maintaining convergence properties.
Provide practical algorithms (SGDP and AdamP) with channel-wise and layer-wise variants.

Experimental results

Research questions

RQ1How does momentum interact with scale-invariant weights to affect the effective learning rate during training?
RQ2Can projecting out the radial component of updates restore or preserve the momentum benefits on the effective weight space?
RQ3Do SGDP and AdamP improve performance over standard SGD/AdamW/Adam across diverse tasks and architectures?
RQ4Is the proposed projection approach computationally efficient enough for large-scale training?

Key findings

Momentum with scale-invariant weights leads to accelerated growth of weight norms, causing rapid decay of effective step sizes.
A simple projection of the momentum update onto the tangent space of the weight sphere prevents norm accumulation while preserving update directions.
SGDP and AdamP show consistent performance gains across 13 benchmarks including ImageNet, retrieval, detection, robustness, audio, and language modeling tasks.
AdamP outperforms baselines on several tasks, e.g., image classification, object detection, robustness benchmarks, and audio classification, with modest overhead.
In Transformer-based language modeling, applying AdamP with weight normalization enables improved perplexity on WikiText-103.
Retrieval benchmarks with ℓ2-normalized embeddings show AdamP yielding gains over AdamW across multiple datasets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.