QUICK REVIEW

[논문 리뷰] AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Juntang Zhuang, Tommy Tang|arXiv (Cornell University)|2020. 10. 15.

Generative Adversarial Networks and Image Synthesis참고 문헌 60인용 수 121

한 줄 요약

AdaBelief는 관측된 그래디언트에 대한 믿음을 사용하여 스텝 사이즈를 적응시키고, 적응적 방법처럼 빠른 수렴을 얻고 SGD처럼 좋은 일반화와 GAN에서의 안정성을 달성하되 추가 하이퍼파라미터 없이 작동합니다.

ABSTRACT

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer

연구 동기 및 목표

적응적 방법의 빠른 수렴성과 어려운 모델(GAN 등)에서의 SGD와 유사한 일반화 및 안정성의 결합 동기를 제시한다.
Gradient 예측과 관찰에 대한 믿음을 사용하여 스텝 사이즈를 조정하는 Adam의 간단한 수정으로서 AdaBelief를 소개한다.
볼록 및 비볼록 설정에 대한 이론적 수렴 보장을 제공한다.
이미지 분류, 언어 모델링 및 GAN 전반에 걸쳐 AdaBelief를 경험적으로 검증하여 성능 향상을 입증한다.

제안 방법

Adam의 분모 sqrt(v_t)을 sqrt(s_t)로 대체하고, s_t는 (g_t - m_t)^2의 EMA이며, m_t는 그래디언트의 EMA로 정의하여 AdaBelief를 정의한다.
1/sqrt(s_t)을 현재 그래디언트 관찰에 대한 prediction m_t에 비해 'belief'(믿음)으로 해석한다.
Ad am과 동일한 하이퍼파라미터와 구조를 유지하여 쉽게 적용할 수 있게 하고, 실무에서 m_t와 s_t에 대한 표준 바이어스 보정을 포함한다.
볼록 및 비볼록 확률적 최적화에 대한 이론적 수렴 분석을 제공한다(Theorem 2.1 and Theorem 2.2 with corollaries).
이미지 분류를 위한 CIFAR/ImageNet, 언어 모델링을 위한 Penn TreeBank, 그리고 생성 품질과 안정성을 위한 GAN(WGAN/WGAN-GP))에서 경험적으로 검증한다.

실험 결과

연구 질문

RQ1AdaBelief가 실용적으로 빠른 수렴을 유지하면서 일반화와 안정성을 개선하는가?
RQ2추가 하이퍼파라미터 없이 GAN 훈련에서 SGD와 유사한 일반화와 안정성을 달성할 수 있는가?
RQ3볼록 및 비볼록 설정에서 AdaBelief의 수렴 보장은 무엇인가?
RQ4대규모 데이터셋(ImageNet)과 비전, 언어, 생성 모델링 등 다양한 태스크에서 Adam 및 SGD 기반 방법과 비교하여 AdaBelief의 성능은 어떤가?

주요 결과

AdaBelief는 CIFAR, ImageNet 및 언어 모델링 과제에서 Adam에 비해 빠른 수렴과 SGD에 비해 일반화를 달성한다.
GAN 훈련에서 AdaBelief는 잘 조정된 Adam 기준선에 비해 샘플 품질 및 훈련 안정성이 향상된다(WGAN/WGAN-GP의 FID 점수 감소).
ImageNet에서 AdaBelief는 SGD와 decoupled weight decay에 비견되는 top-1 정확도를 달성해 일부 적응적 방법에서 관찰된 일반화 격차를 좁힌다.
이론적 결과는 AdaBelief가 볼록 설정에서 O(sqrt(T))의 후퇴를 보이고 비볼록 확률적 최적화에서 O(log T / sqrt(T))의 수렴을 보이며 표준 가정하에서 작동함을 보여준다.
경험적 결과에는 CIFAR 및 ImageNet의 VGG/ResNet/DenseNet에서 강한 성능, Penn TreeBank의 LSTM perplexity 개선, 다양한 구성에서의 GAN 지표(FID) 우수성이 포함된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.