QUICK REVIEW

[논문 리뷰] AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun|arXiv (Cornell University)|2020. 06. 15.

Advanced Neural Network Applications참고 문헌 66인용 수 81

한 줄 요약

AdamP은 모멘텀 최적화에서 반지름 방향 성분을 제거하는 투영 기반 업데이트를 도입하여 스케일 불변 가중치에 대해 효과적 스텝 크기를 보존하고 다양한 작업에서 성능 향상을 가져온다.

ABSTRACT

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.

연구 동기 및 목표

문제의 동기 부여: 정규화 계층으로 인한 스케일 불변성으로 가중치가 스케일에 불변해져 모멘텀 기반 최적화에서 효과적 스텝 크기가 감소한다.
모멘텀이 스케일 불변 가중치에서 노름의 증가를 가속시켜 학습 효율을 저하시킨다는 것을 조사한다.
업데이트 방향을 보존하면서 효과적 스텝 크기를 안정화하는 단순한 투영 기반 해결책(SGDP/AdamP)을 제안한다.
다양한 벤치마크와 네트워크 아키텍처에서 방법의 효과를 시연한다.
실제 학습 파이프라인에서 이 접근 방식을 적용하기 위한 실용적 지침과 코드를 제공한다.

제안 방법

모멘텀을 가진 SGD/Adam에서 스케일 불변성이 효과적 스텝 크기에 미치는 영향을 모델링한다.
모멘텀 아래에서 가중치 노름의 증가가 정규화된 가중치의 구면에서 효과적 스텝의 감소를 가속화한다는 것을 도출한다.
업데이트에서 반지름(노름 증가) 성분을 제거하기 위해 가중치의 접선 공간에 대한 투영 연산자를 도입한다.
코사인 유사도에 따라 스케일 불변 가중치를 탐지하기 위해 투영 조건을 조건부로 적용하는 모멘텀 기반 최적화 알고리즘인 SGDP와 AdamP를 정의한다.
투영된 업데이트가 정규화된 가중치 구면에서 효과적 방향을 보존하여 수렴 특성을 유지함을 주장한다.
채널별 및 계층별 변형을 갖춘 실용 알고리즘(SGDP 및 AdamP)을 제공한다.

실험 결과

연구 질문

RQ1모멘텀이 스케일 불변 가중치와 어떻게 상호작용하여 학습 중 효과적 학습률에 영향을 미치는가?
RQ2업데이트의 반지름 성분을 투영하여 효과적 가중치 공간에서 모멘텀의 이점을 회복하거나 보존할 수 있는가?
RQ3SGDP와 AdamP가 다양한 작업과 아키텍처에서 표준 SGD/AdamW/Adam보다 성능을 개선하는가?
RQ4제안된 투영 접근법이 대규모 학습에 충분히 계산 효율적인가?

주요 결과

스케일 불변 가중치를 가진 모멘텀은 가중치 노름의 증가를 가속화하여 효과적 스텝 크기의 급격한 감소를 초래한다.
가중치 구의 접선 공간으로 모멘텀 업데이트를 단순히 투영하는 것은 노름 축적을 방지하면서 업데이트 방향을 보존한다.
SGDP와 AdamP는 ImageNet, 검색, 탐지, 강건성, 오디오, 언어 모델링 작업 등 13개 벤치마크에서 일관된 성능 향상을 보인다.
AdamP는 다수의 작업에서 기본 모델보다 우수하며, 예를 들어 이미지 분류, 객체 탐지, 강건성 벤치마크, 오디오 분류 등에서 약간의 오버헤드로 우수한 성능을 보인다.
트랜스포머 기반 언어 모델링에서 가중치 정규화와 함께 AdamP를 적용하면 WikiText-103에서 Perplexity가 개선된다.
ℓ2 정규화된 임베딩을 사용하는 검색 벤치마크에서 AdamP가 다수의 데이터셋에서 AdamW보다 이득을 준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.