QUICK REVIEW

[논문 리뷰] Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev, Mike Del Balso|arXiv (Cornell University)|2018. 02. 15.

Advanced Neural Network Applications참고 문헌 5인용 수 522

한 줄 요약

Horovod는 Ring-Allreduce 기반의 분산 TensorFlow 프레임워크를 도입하여 코드 변경을 크게 줄이고 확장을 개선하며, 여러 GPU에서 거의 선형 속도향상을 가능하게 한다. 독립 실행형 Python 패키지, NCCL 기반 통신, 그리고 디버깅/프로파일링 도구를 제공한다.

ABSTRACT

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

연구 동기 및 목표

Uber에서 확장 가능한 분산 TensorFlow 학습의 필요성을 동기화하고 두 가지 주요 장애물: GPU 간 통신 오버헤드와 사용자 코드의 복잡성을 식별한다.
확장성과 단순성을 다루기 위해 링-올리듀스 기반 접근법을 제안한다.
Horovod의 아키텍처, TensorFlow/Keras와의 통합 및 사용자의 편집을 최소화하는 API 설계를 설명한다.
실용적인 도구(Horovod Timeline)와 최적화(Tensor Fusion)를 시연하여 사용성과 성능을 향상시킨다.

제안 방법

Baidu의 초안의 링-올리듀스를 채택하고 최적화된 GPU 간 및 기계 간 통신을 위해 NVIDIA NCCL로 대체했다.
Horovod를 독립 실행형 Python 패키지로 구현하여 특정 TensorFlow 릴리스로부터 분리했다.
단일 서버에 적합한 모델(다중 GPU 가능)에 대한 지원을 확장했다.
워커 간 일관된 시작을 보장하기 위한 브로드캐스트 초기화 훅을 도입했다.
사용자가 hvd.DistributedOptimizer로 옵티마이저를 래핑하고 랭크 0에서 변수 브로드캐스트를 수행할 수 있는 최소 API를 제공했다.
크로스 노드 프로파일링 및 디버깅을 위한 Horovod Timeline을 도입했다.
작은 텐서를 큰 버퍼로 융합하도록 Tensor Fusion을 개발하여 TCP 네트워크에서 처리량을 향상시켰다.

실험 결과

연구 질문

RQ1링-올리듀스 기반 통신이 여러 GPU 및 기계에서 TensorFlow 학습에 대해 거의 선형에 근접한 확장을 제공할 수 있는가?
RQ2단일-GPU TensorFlow 프로그램을 분산 Horovod 프로그램으로 전환하는 데 필요한 코드 수정은 얼마나 되는가?
RQ3실제 워크플로에서 사용성 및 성능을 개선하는 실용적 도구 및 최적화(Tensor Fusion 및 Timeline 등)는 무엇인가?
RQ4TCP와 RDMA 네트워크에서의 Horovod의 성능 특성과 매개변수 수가 다른 모델에 대한 성능은 어느가인가?
RQ5효율성 및 자원 활용 측면에서 Horovod와 표준 분산 TensorFlow를 어떻게 비교되는가?

주요 결과

Horovod는 표준 분산 TensorFlow에 비해 상당한 확장 성능 향상을 달성했으며 벤치마크에서 최대 88%의 효율이 보고되었다.
다중 GPU에서 Horovod를 사용할 때 표준 분산 TensorFlow에 비해 학습 속도가 거의 두 배가 될 수 있다.
RDMA 네트워킹은 일부 모델에서 미미한 이득(추가 3-4%)을 제공하고 특정 아키텍처에서 확장 효율을 90% 이상으로 끌어올릴 수 있다.
Tensor Fusion은 작은 텐서 연산이 많은 모델에서 통신 오버헤드를 줄여 최대 65% 개선을 제공한다.
Horovod는 설정 및 통합 노력을 몇 줄의 코드 변경으로 줄여 팀 간의 도입을 쉽게 한다.
Horovod Timeline은 디버깅 및 성능 분석을 돕기 위해 브라우저에서 접근 가능한 고수준 프로파일링을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.