QUICK REVIEW

[논문 리뷰] Understanding AdamW through Proximal Methods and Scale-Freeness

Zhenxun Zhuang, Mingrui Liu|arXiv (Cornell University)|2022. 01. 31.

Neural Networks and Applications인용 수 36

한 줄 요약

본 논문은 AdamW가 근접 업데이트의 근사이며 스케일 프리(scale-free)하다고 제시하고, 특히 매우 깊은 네트에서 배치 정규화 없이도 AdamL2에 비해 최적화 이점을 제공한다; 또한 스케일 프리니스를 조건 수 감소와 연결한다.

ABSTRACT

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as Adam-$\ell_2$). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-$\ell_2$. Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$. Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-$\ell_2$ and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates.

연구 동기 및 목표

분리된 가중치 감소(AdamW)가 L2 정규화(Adam-L2)를 사용하는 Adam보다 일반화와 최적화를 어떻게 개선하는지 동기를 부여하고 이해한다.
AdamW를 근접 업데이트와 연결하는 근접(프로시멀) 최적화 관점을 제시하고, 실험적 이점을 설명하기 위해 스케일 프리의 이점을 활용한다.
특히 배치 정규화가 없는 매우 깊은 네트워크와 같은 학습 시나리오에서 AdamW가 Adam-L2를 현저히 능가하는 것을 고찰한다.
실용적으로 ε가 0이 아닌 상황에서 AdamW의 스케일 프리 특성의 강건성을 검토하고 이를 심층 네트워크의 업데이트 동작과 연결하여 분석한다.

제안 방법

AdamW가 규제된 목적 함수 F(x) = (λ/2)||x||^2 + f(x)에 대한 근접 업데이트를 근사함을 도출하고 제시한다.
AdamW가 M_t = η_t I_d 및 p_t = α m̂_t/(√v̂_t+ε)인 근접 업데이트의 1차 테일러 근사에 해당함을 보인다.
ε=0일 때 AdamW와 근접 업데이트가 스케일-프리임을 증명하며, λ>0일 때 스케일-프리니스를 잃는 Adam-L2와는 다름을 보인다.
스케일 프리니스를 자동 프리컨디셔닝과 특정 함수 클래스의 조건수 의존도 개선으로 이끈다는 이론적 주장을 제시한다.
배치 정규화가 없는 네트워크에서 손실을 스케일링하고 업데이트 안정성을 관찰하여 스케일 프리니스를 경험적으로 검증한다.
ResNet 및 DenseNet 아키텍처를 사용하여 CIFAR-10/100에서 BN 있는/없는 조건으로 AdamW, AdamProx 및 Adam-L2를 비교한다.

실험 결과

연구 질문

RQ1AdamW가 규제된 목적 함수에 대한 근접 업데이트로 작용하는가, 그렇다면 어떤 근사에서 그런가?
RQ2스케일 프리니스가 AdamW의 최적화 동작과 수렴에 어떤 영향을 미치는지, Adam-L2와 비교하여?
RQ3어떤 학습 설정에서(예: BN이 없는 매우 깊은 네트워크 등) AdamW가 Adam-L2보다 우수한가, 그리고 그 이유는 무엇인가?
RQ4ε가 0이 아닌 실제 상황에서 AdamW가 거의 스케일 프리인지 여부와 이 특성의 강건성은 어느 정도인가?
RQ5일반적인 학습률 스케줄에서 AdamW와 AdamProx가 비슷한 최적화 역학을 보이는가?

주요 결과

AdamW는 정규화 항의 전체를 활용하는 근접 업데이트의 근사치이다.
AdamW와 근접 업데이트는 스케일-프리(ε≈0일 때)인 반면, Adam-L2는 λ>0일 때 스케일-프리니스를 잃는다.
스케일 프리니스를 자동 프리컨디셔닝으로 제공하여 특정 함수들에 대해 조건수에 대한 민감도를 줄여준다.
배치 정규화 없이도 매우 깊은 네트워크에서 학습 및 테스트 모두에서 AdamW가 Adam-L2를 현저히 능가한다.
네트워크 깊이가 증가함에 따라 Adam-L2의 업데이트 스케일이 AdamW보다 더 다양해지며, 이는 AdamW의 정확도 이득이 더 커지는 것과 상관관계가 있다.
일반적인 학습률 스케줄에서 AdamW는 AdamProx와 거의 동등하므로 근접 해석을 뒷받침한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.