QUICK REVIEW

[논문 리뷰] Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Elad Hoffer, Itay Hubara|arXiv (Cornell University)|2017. 05. 24.

Domain Adaptation and Few-Shot Learning참고 문헌 38인용 수 418

한 줄 요약

논문은 대배치 SGD의 일반화 격차가 업데이트가 지나치게 적은 원인이며, 배치 크기가 아니라는 점을 제시하고, 학습률 스케일링, Ghost Batch Normalization, 레짐 적응이 격차를 좁히는 방법을 보여준다.

ABSTRACT

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.

연구 동기 및 목표

신경망에서 대배치 학습으로 관찰되는 일반화 격차를 동기 부여하고 특징화한다.
초기 학습에서의 가중치 다이나믹스를 설명하기 위해 확률적 최적화 모델(무작위 포텐셜 위의 무작위 보행)을 제안한다.
학습률 스케일링, Ghost Batch Normalization(GBN), 및 레짐 적응을 통해 격차를 줄이기 위한 실용적인 방법을 개발한다.
여러 아키텍처에 걸쳐 MNIST, CIFAR-10/100, ImageNet에서 실증적으로 검증한다.
훈련 관행을 재고하고 일반화가 단순히 배치 크기보다 업데이트 수에 의존한다는 점을 강조한다.

제안 방법

가중치의 초느린 확산을 설명하기 위해 SGD를 무작위 포텐셜 위의 무작위 보행으로 모델링한다.
초기화로부터의 가중치 거리 가 업데이트에 따라 대략 로그 t만큼 증가한다는 것을 도출하고 확산 속도를 배치 크기와 연계한다.
업데이트 통사를 보존하기 위한 배치 크기에 대한 학습률 스케일링(eta ∝ sqrt(M))를 제안한다.
대배치 내에 작은 고스트 배치를 두고 배치 정규화 통계를 계산하도록 Ghost Batch Normalization을 도입한다.
훈련 반복을 연장하여 배치 크기에 관계없이 업데이트 수를 비슷하게 유지하는 레짐 적응을 권고한다.
표준 데이터셋과 네트워크를 사용하여 실증적으로 검증하고 SB/LB 구간에서 정확도 향상을 보고한다.

실험 결과

연구 질문

RQ1총 학습 시간을 늘리지 않으면서 대배치 학습에서 관찰되는 일반화 격차를 없앨 수 있는가?
RQ2초기 학습 동안의 가중치 업데이트가 최종 일반화에 어떤 기제를 통해 영향을 주는지, 배치 크기와 업데이트 수가 어떻게 상호 작용하는지?
RQ3학습률 스케일링과 Ghost Batch Normalization과 같은 조정이 아키텍처와 데이터셋에 걸쳐 일반화 격차를 일관되게 감소시키거나 제거하는가?
RQ4대배치의 훈련 체계를 확장하여 소배치의 일반화 성능과 일치시키는 것이 가능한가?

주요 결과

네트워크	데이터셋	SB	LB	+LR	+GBN	+RA
F1 (Keskar et al., 2017)	MNIST	98.27%	97.05%	97.55%	97.60%	98.53%
C1 (Keskar et al., 2017)	CIFAR-10	87.80%	83.95%	86.15%	86.40%	88.20%
Resnet44 (He et al., 2016)	CIFAR-10	92.83%	86.10%	89.30%	90.50%	93.07%
VGG (Simonyan, 2014)	CIFAR-10	92.30%	84.10%	88.60%	91.50%	93.03%
C3 (Keskar et al., 2017)	CIFAR-100	61.25%	51.50%	57.38%	57.50%	63.20%
WResnet16-4 (Zagoruyko, 2016)	CIFAR-100	73.70%	68.15%	69.05%	71.20%	73.57%

대배치에서의 일반화 격차는 학습률 스케일링과 Ghost Batch Normalization으로 크게 줄일 수 있다.
초기화로부터의 가중치 거리는 업데이트에 따라 대수적으로 증가하며 배치 크기에 관계없이 일치, 확산 역학이 일반화를 배치 크기 자체보다 더 지배한다는 것을 시사한다.
배치 크기의 제곱근으로 학습률을 스케일링하면 업데이트 통계를 보존하고 일반화를 향상시키는 데 도움이 된다.
대배치로 학습하는 동안 작은 고스트 배치를 사용해 배치 통계를 계산함으로써 일반화 오차를 크게 줄이는 Ghost Batch Normalization.
가중치 업데이트 수를 조정하는 레짐 적응이 소배치 반복 횟수에 맞추어 격차를 제거하고 유사하거나 더 나은 검증 정확도를 확보한다.
MNIST, CIFAR-10/100, 및 ImageNet 실험에서 +LR, +GBN, 및 +RA로 일관된 이득을 보여주며 종종 SB 결과에 필적하거나 이를 능가한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.