QUICK REVIEW

[논문 리뷰] PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Li Shen, Yanli Zhao|arXiv (Cornell University)|2020. 06. 28.

Software System Performance and Reliability참고 문헌 26인용 수 116

한 줄 요약

이 논문은 데이터-parallel 학습을 가속하기 위해 PyTorch DistributedDataParallel (DDP)의 설계, 구현 및 평가를 제시하며, 그래디언트 버킷화, 계산과 통신의 중첩, 그리고 그래디언트 동기화의 건너뛰기를 포함하여 대형 GPU 수에서 거의 선형 확장을 달성한다.

ABSTRACT

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

연구 동기 및 목표

PyTorch의 분산 데이터-병렬 모듈(DDP)의 설계와 구현을 시연한다.
분산 학습과 로컬 학습 간의 수학적 동등성을 그래디언트 동기화를 통해 달성하는 방법을 보여준다.
성능 병목을 식별하고 학습 처리량을 극대화하기 위한 최적화 기법을 제시한다.
내부 및 외부 배포로부터의 실제 통찰과 측정을 제공한다.
산업 규모의 분산 학습에서의 실용적 주의점과 향후 개선 방향을 강조한다.

제안 방법

DDP를 로컬 모델을 래핑하는 nn.Module로 제시하여 비간섭적 통합을 보장한다.
Autograd 훅과 AllReduce 기반 그래디언트 평균화를 포함한 그래디언트 축소 기법을 설명한다.
작은 그래디언트를 더 큰 버킷으로 묶어 AllReduce 효율성을 향상시키는 그래디언트 버킷화를 도입한다.
지연을 숨기기 위해 그래디언트 축소에서의 연산과 통신의 중첩을 설명한다.
여러 반복에 걸친 그래디언트 축적을 가능하게 하는 no_sync 컨텍스트 매니저를 논한다.
NCCL, Gloo, MPI와 같은 집단 백엔드 및 통신 경로를 구성하는 ProcessGroup 추상화를 자세히 설명한다.

실험 결과

연구 질문

RQ1PyTorch의 DDP가 사용자의 코드에 비간섭적으로 남아 로컬 학습과 수학적으로 동등성을 보장하는 방법은 무엇인가?
RQ2버킷화, 중첩, skip_sync와 같은 최적화가 분산 데이터-병렬 학습 성능을 어떻게 가장 잘 향상시키는가?
RQ3다양한 통신 백엔드(NCCL, Gloo, MPI)가 확장성 및 처리량에 어떤 영향을 미치는가?
RQ4DDP를 대규모로 배포할 때의 실용적 주의점과 실패 모드는 무엇인가?
RQ5버킷 크기, 프로세스 그룹, 미사용 매개변수 처리 등 런타임 구성이 수렴 및 속도에 어떤 영향을 미치는가?

주요 결과

DDP는 적절하게 구성되면 최대 256 GPUs에서 거의 선형 확장을 달성할 수 있다.
그래디언트 버킷화와 계산-통신 중첩은 특히 작은 매개변수를 많이 가진 모델의 경우 성능을 크게 향상시킨다.
동기화를 건너뛰는 no_sync는 수렴 속도에 미치는 영향이 미미한 범위에서 통신 오버헤드를 보상적으로 감소시킨다.
통신은 지연의 지배적 구성요소이며 버킷 크기가 효율성에 큰 영향을 미친다; 잘못된 버킷 크기는 이점을 무력화할 수 있다.
NCCL 및 Gloo 백엔드는 서로 다른 성능 특성을 보이며, 최적 처리량을 위해 버킷 크기 및 프로세스 그룹 구성이 중요하다.
실험은 생산 워크로드에서 DDP의 광범위한 적용과 영향력을 뒷받침한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.