QUICK REVIEW

[논문 리뷰] Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication.

Hao Yu, Sen Yang|arXiv (Cornell University)|2018. 07. 17.

Stochastic Gradient Optimization Techniques참고 문헌 15인용 수 50

한 줄 요약

이 논문은 대규모 비볼록 문제를 위한 통신 효율적인 최적화 방법인 Parallel Restarted SGD를 제안한다. 이 방법은 주기적인 재시작 시점에만 모델 평균을 교환함으로써 워커 간 통신을 줄인다. 기존의 병렬 미니배치 SGD와 동일한 수렴 속도를 달성하면서도 통신 오버헤드를 $O(T^{1/4})$만큼 감소시켜 딥러닝에서 모델 평균화의 경험적 성공에 대한 이론적 근거를 제공한다.

ABSTRACT

For large scale non-convex stochastic optimization, parallel mini-batch SGD using multiple workers ideally can achieve a linear speed-up with respect to the number of workers compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for communication as more workers are involved. This is because the classical parallel mini-batch SGD requires gradient or model exchanges between workers (possibly through an intermediate server) at every iteration. In this paper, we study whether it is possible to maintain the linear speed-up property of parallel mini-batch SGD by using less frequent message passing between workers. We consider the parallel restarted SGD method where each worker periodically restarts its SGD by using the node average as a new initial point. Such a strategy invokes inter-node communication only when computing the node average to restart local SGD but otherwise is fully parallel with no communication overhead. We prove that the parallel restarted SGD method can maintain the same convergence rate as the classical parallel mini-batch SGD while reducing the communication overhead by a factor of $O(T^{1/4})$. The parallel restarted SGD strategy was previously used as a common practice, known as model averaging, for training deep neural networks. Earlier empirical works have observed that model averaging can achieve an almost linear speed-up if the averaging interval is carefully controlled. The results in this paper can serve as theoretical justifications for these empirical results on model averaging and provide practical guidelines for applying model averaging.

연구 동기 및 목표

대규모 비볼록 최적화를 위한 병렬 미니배치 SGD의 통신 병목 현상을 해결하기 위해.
통신 빈도를 줄여도 선형적 속도 향상을 유지하면서 수렴 속도를 손상시키지 않는지 조사하기 위해.
딥러닝에서 모델 평균화의 경험적 성공에 대한 이론적 근거를 제공하기 위해.
자주 지역 업데이트를 수행하면서도 희박한 동기화를 통해 확장성을 향상시키는 방법을 설계하기 위해.

제안 방법

각 워커는 반복 간 통신 없이 독립적으로 지역 SGD 업데이트를 수행한다.
정기적인 간격마다 워커들이 모델을 교환하고 평균을 내어 새로운 글로벌 초기화 포인트를 계산한다.
각 워커는 평균화된 모델에서 다시 지역 SGD를 시작함으로써 몇 번의 반복마다 효과적으로 진도를 동기화한다.
이 방법은 주기적인 재시작을 통해 지속적인 기울기 교환 없이도 수렴을 유지한다.
이론적 분석에 따르면, 표준 가정 하에 수렴 속도가 고전적 병렬 미니배치 SGD와 동일하다.
통신은 오직 모델 평균화 단계에서만 발생하므로, 전체 통신 횟수는 전체 빈도 방법 대비 $O(T^{1/4})$만큼 감소한다.

실험 결과

연구 질문

RQ1병렬 SGD에서 통신 빈도를 줄여도 고전적 병렬 미니배치 SGD와 동일한 수렴 속도를 유지할 수 있는가?
RQ2비볼록 최적화에서 주기적인 모델 평균화가 수렴에 미치는 이론적 영향은 무엇인가?
RQ3통신 빈도는 병렬 SGD의 확장성과 속도 향상에 어떻게 영향을 미치는가?
RQ4모델 평균화의 관찰된 경험적 성공은 이론적으로 설명할 수 있는가?

주요 결과

제안된 Parallel Restarted SGD는 표준 비볼록 최적화 가정 하에 고전적 병렬 미니배치 SGD와 동일한 수렴 속도를 달성한다.
전체 반복 수 $T$에 비해 전체 통신 방법 대비 통신 오버헤드가 $O(T^{1/4})$만큼 감소한다.
통신 빈도가 낮아졌음에도 불구하고 워커 수에 비례한 선형 속도 향상을 유지한다.
이론적 분석을 통해 주기적인 모델 평균화와 재시작이 수렴을 보장하는 데 충분함을 확인하였으며, 이는 딥러닝 학습에서의 활용을 뒷받침한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.