QUICK REVIEW

[논문 리뷰] Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

Soham De, Anirbit Mukherjee|arXiv (Cornell University)|2018. 07. 18.

Stochastic Gradient Optimization Techniques참고 문헌 37인용 수 82

한 줄 요약

논문은 매끄러운 비볼록 최적화에서 RMSProp와 ADAM의 수렴 보장을 제시하고, 이를 Nesterov 가속과 오토인코더 및 CIFAR-10에서의 실험으로 경험적으로 비교한다. 또한 특히 ADAM의 모멘텀 파라미터를 중심으로 하이퍼파라미터 민감도 분석을 수행한다.

ABSTRACT

RMSProp and ADAM continue to be extremely popular algorithms for training neural nets but their theoretical convergence properties have remained unclear. Further, recent work has seemed to suggest that these algorithms have worse generalization properties when compared to carefully tuned stochastic gradient descent or its momentum variants. In this work, we make progress towards a deeper understanding of ADAM and RMSProp in two ways. First, we provide proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and we give bounds on the running time. Next we design experiments to empirically study the convergence and generalization properties of RMSProp and ADAM against Nesterov's Accelerated Gradient method on a variety of common autoencoder setups and on VGG-9 with CIFAR-10. Through these experiments we demonstrate the interesting sensitivity that ADAM has to its momentum parameter $β_1$. We show that at very high values of the momentum parameter ($β_1 = 0.99$) ADAM outperforms a carefully tuned NAG on most of our experiments, in terms of getting lower training and test losses. On the other hand, NAG can sometimes do better when ADAM's $β_1$ is set to the most commonly used value: $β_1 = 0.9$, indicating the importance of tuning the hyperparameters of ADAM to get better generalization performance. We also report experiments on different autoencoders to demonstrate that NAG has better abilities in terms of reducing the gradient norms, and it also produces iterates which exhibit an increasing trend for the minimum eigenvalue of the Hessian of the loss function at the iterates.

연구 동기 및 목표

비-볼록 최적화에서 적응형 그래디언트 방법(RMSProp와 ADAM)에 대한 첫 수렴 보장을 제공한다.
스무스함 가정 아래 근사 임계점에 도달하기 위한 실행 시간 상한을 도출한다.
오토인코더와 CIFAR-10에서 RMSProp와 ADAM을 Nesterov의 가속경사(NAG)와 경험적으로 비교한다.
특히 ADAM의 모멘텀 파라미터 β1를 포함한 하이퍼파라미터 민감도와 일반화 경향을 강조한다.

제안 방법

L-스무스한 비볼록 목적함수와 유한합 구조 f(x)=k^{-1} sum_p f_p(x)를 정의한다.
Deterministic 및 stochastic 설정에서 RMSProp와 ADAM 업데이트를 도입하고 분석한다.
기술적 오라클 가정 하에서 확률적 RMSProp의 근사 임계점으로의 수렴을 증명한다.
자동인코더와 CIFAR-10의 VGG-9 실험을 통해 Nesterov Accelerated Gradient(NAG)와 비교한다.
적응 방법에 대한 대각선 프리컨디셔너 프레임워크와 대응하는 수렴 증명을 활용한다.

실험 결과

연구 질문

RQ1RMSProp와 ADAM이 매끄러운 비볼록 최적화에서 근사 임계점으로 수렴하는가?
RQ2이 이러한 적응 방법들이 근사적 정지점에 도달하는 실행 시간 상한은 얼마인가?
RQ3RMSProp와 ADAM이 뉴럴 네트의 학습 및 일반화 측면에서 NAG와 비교하여 어떤 차이를 보이는가?
RQ4모멘텀 파라미터 β1은 ADAM의 성능과 일반화에 어떻게 영향을 미치는가?
RQ5네트워크 크기가 커짐에 따라 적응 방법은 비적응 방법과 다르게 일반화되는가?

주요 결과

적응형 그래디언트 방법(RMSProp와 ADAM)이 매끄러운 비볼록 목적함수에서 근사 임계점으로 수렴한다는 최초의 수렴 보장을 확립한다.
확률적 RMSProp의 수렴은 그래디언트 오라클에 대한 추가 가정 하에서 보인다.
실험적으로 ADAM이 모멘텀 파라미터 β1에 매우 민감하다는 것이 나타났으며, β1=0.99는 때때로 특히 조심스럽게 조정된 NAG를 능가하거나 비슷한 성능을 보인다.
전체 배치 및 대형 네트 규模에서 β1이 큰 ADAM은 자동인코더에서 NAG와 RMSProp에 비해 학습 및 테스트 손실이 더 낮은 경우가 있다.
오토인코더에서 NAG는 그래디언트 노름을 감소시키고 최소 해시센 고유값 경향이 증가하는 반복점을 생성하는 경향이 있다.
CIFAR-10(VGG-9)에서의 경험적 비교는 자동인코더를 넘어 확장된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.