QUICK REVIEW

[논문 리뷰] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Hao Zhang, Zeyu Zheng|arXiv (Cornell University)|2017. 06. 10.

Advanced Neural Network Applications참고 문헌 29인용 수 197

한 줄 요약

Poseidon은 GPU에서 데이터-병렬 분산 딥러닝을 위한 계층화된 대기 없는(wait-free) 하이브리드 통신 아키텍처를 도입하여 계산과 통신을 겹치고 레이어별 최적의 통신 방법을 선택함으로써 거의 선형에 가까운 스케일링을 달성한다.

ABSTRACT

Deep learning models can take weeks to train on a single GPU-equipped machine, necessitating scaling out DL training to a GPU-cluster. However, current distributed DL implementations can scale poorly due to substantial parameter synchronization over the network, because the high throughput of GPUs allows more data batches to be processed per unit time than CPUs, leading to more frequent network synchronization. We present Poseidon, an efficient communication architecture for distributed DL on GPUs. Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication. Moreover, Poseidon uses a hybrid communication scheme that optimizes the number of bytes required to synchronize each layer, according to layer properties and the number of machines. We show that Poseidon is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow. We show that Poseidon enables Caffe and TensorFlow to achieve 15.5x speed-up on 16 single-GPU machines, even with limited bandwidth (10GbE) and the challenging VGG19-22K network for image classification. Moreover, Poseidon-enabled TensorFlow achieves 31.5x speed-up with 32 single-GPU machines on Inception-V3, a 50% improvement over the open-source TensorFlow (20x speed-up).

연구 동기 및 목표

GPU 클러스터에서 버스트처럼 발생하고 대용량인 매개변수 동기화로 인한 확장 가능한 분산 딥러닝의 필요성을 제시한다.
DL 모델의 계층별 구조를 이용해 계산과 통신을 겹치도록 Poseidon을 제안한다.
레이어별로 가장 저렴한 동기화 방법을 선택하기 위한 하이브리드 통신 스킴을 도입한다.
Poseidon을 Caffe와 TensorFlow에 통합하여 프레임워크 간 적용 가능성을 입증한다.

제안 방법

DL 학습을 계층별 계산 및 동기화 단계로 분해하여 순전파/역전파와 통신 간의 겹침을 가능하게 한다.
Wait-free Backpropagation (WFBP)을 도입하여 독립적인 연산을 동시 실행함으로써 기울기 동기화를 하위 계층 계산과 겹치게 한다.
레이어 속성 및 클러스터 구성에 기반해 각 레이어에 최적의 동기화 방법(PS, SFB, Adam과 유사한 전략)을 선택하는 Hybrid Communication (HybComm)을 제안한다.
Poseidon을 코디네이터, KV 스토어, 클라이언트 라이브러리의 세 구성요소 시스템으로 구현하여 통신 일정 및 전송을 관리하는 API를 제공한다.
현존 프레임워크(Caffe 및 TensorFlow)에 최소한의 코드 변경으로 통합하고 최대 32 GPUs까지 거의 선형 스케일링을 달성하는 것을 시연한다.

실험 결과

연구 질문

RQ1DL 학습을 어떻게 재구성하여 통신 비용을 숨기고 GPU 클러스터에서 네트워크 버스트를 줄일 수 있는가?
RQ2다양한 대역폭과 모델 크기에서 계층별 하이브리드 통신 전략이 표준 PS 또는 SFB 스킴에 비해 처리량을 향상시킬 수 있는가?
RQ3Poseidon이 여러 DL 프레임워크와 대규모 모델에서 얼마나 근접하게 선형 처리량 스케일링을 달성할 수 있는가?
RQ4대표적인 CNN과 데이터셋에서 Poseidon이 수렴 속도와 전체 학습 효율성에 미치는 영향은 무엇인가?

주요 결과

방법	서버	워커	서버/워커
PS	2P1MN/P2	2MN	2MN(P1+P2-2)/P2
SFB	N/A	2K(P1-1)(M+N)	N/A
Adam (max)	P1MN+P1K(M+N)	K(M+N)+MN	(P1-1)(MN+KM+KN)

Poseidon은 여러 모델과 프레임워크에 걸쳐 최대 32 Titan X GPU에서 거의 선형에 가까운 처리량 스케일링을 달성한다.
32 노드에서 Poseidon이 적용된 TensorFlow는 Inception-V3에 대해 31.5배의 속도향상을 달성하고 원래 TensorFlow보다 속도향상이 50% 더 뛰어나다.
제한된 10GbE 대역폭의 16대 머신에서 Poseidon은 대형 모델(VGG19-22K)에서 PS 기반 병렬화보다 더 나은 스케일링을 유지한다.
Poseidon은 각 레이어에 최적의 통신 방법을 자동으로 특화시켜 네트워크 통신 병목을 줄이고 대역폭 활용을 개선한다.
Adam이나 CNTK의 1-bit 양자화 같은 SF 전략과 비교할 때 Poseidon은 더 높은 알고리즘 처리량이나 더 강한 통계적 성능 안정성을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.