QUICK REVIEW

[논문 리뷰] Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Chi‐Chung Chen, Chia-Lin Yang|arXiv (Cornell University)|2018. 09. 08.

Advanced Neural Network Applications참고 문헌 49인용 수 78

한 줄 요약

이 논문은 SpecTrain이라는 파이프라인식 모델 병렬화의 가중치 예측 기법을 제시하여 다중-GPU DNN 학습에서의 staleness를 완화하고, 데이터 병렬성과의 정확도 비교를 유지하며 4개의 GPU에서 최대 8.91x의 속도 향상을 달성합니다.

ABSTRACT

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays. Due to the implementation simplicity, data parallelism is currently the most commonly used parallelization method. Nonetheless, data parallelism suffers from excessive inter-GPU communication overhead due to frequent weight synchronization among GPUs. Another approach is pipelined model parallelism, which partitions a DNN model among GPUs, and processes multiple mini-batches concurrently. This approach can significantly reduce inter-GPU communication cost compared to data parallelism. However, pipelined model parallelism faces the weight staleness issue; that is, gradients are computed with stale weights, leading to training instability and accuracy loss. In this paper, we present a pipelined model parallel execution method that enables high GPU utilization while maintaining robust training accuracy via a novel weight prediction technique, SpecTrain. Experimental results show that our proposal achieves up to 8.91x speedup compared to data parallelism on a 4-GPU platform while maintaining comparable model accuracy.

연구 동기 및 목표

다중-GPU 플랫폼에서 GPU 간 통신 오버헤드로 인한 데이터 병렬성의 한계를 동기부여하고 분석합니다.
GPU 활용도 향상과 GPU 간 데이터 전송 감소를 위해 파이프라인식 모델 병렬화를 모색합니다.
파이프라인식 모델 병렬화에 내재된 가중치 오래됨 문제를 식별하고 해결합니다.
훈련의 견고성과 정확성을 유지하기 위한 스무스한 기울긋값 기반의 가중치 예측 기법인 SpecTrain을 도입합니다.
다양한 CNN, FCN, RNN 모델에서 4-GPU 플랫폼에서 처리량과 정확도를 평가합니다.

제안 방법

PipeDream에서 파생된 파이프라인식 모델 병렬 학습 프레임워크를 채택하고 가중치 예측을 위한 SpecTrain을 도입합니다.
모멘텀 SGD의 스무스한 그래디언트를 사용해 초기 파이프라인 단계 동안 가중치 업데이트를 추정합니다.
W_hat_{t+s} = W_t - s * eta * v_{t-1}의 식으로 예측 가중치를 계산하며, v_{t-1}은 스무스한 그래디언트입니다.
GPU 인덱스와 미니배치가 순방향/역전파 중 어느 단계에 있는지에 따라 버전 차이 s를 결정합니다(논문에 구체적 수식 제시).
CIFAR-10/IMDb 데이터셋에서 6개 모델(CNN, FCN, RNN)에 대해 Data Parallelism, Vanilla Model P., PipeDream(Weight Stashing), SpecTrain를 비교합니다.
처리량, 수렴 거동 및 정확도 분석을 제공해 견고성과 성능 향상을 입증합니다.

실험 결과

연구 질문

RQ1데이터 병렬성이 다중-GPU 플랫폼에서 GPU 간 통신 및 학습 효율성 측면에서 모델 병렬성과 어떻게 비교되는가?
RQ2파이프라인식 모델 병렬화가 정확도에 손실을 주지 않으면서 높은 GPU 활용도를 달성할 수 있는가, 가중치 오래됨을 완화하는 메커니즘은 무엇인가?
RQ3SpecTrain의 가중치 예측이 데이터 병렬성 및 기존 파이프라인 방식에 비해 처리량 향상을 제공하면서 모델 정확도를 유지하거나 개선하는가?
RQ4다중-GPU 시스템에서 CNN, FCN, RNN 모델에 대한 SpecTrain의 성능/정확성 트레이드-오프는 어떠한가?

주요 결과

Parallelization Scheme	Min. Train Loss	Min. Val Loss	Max. Val Accuracy
VGG16	Data P.	0.213271	0.794613	73.4776%
VGG16	Vanilla Model P.	0.204126	0.811148	73.0569%
VGG16	PipeDream	0.200585	0.811144	72.8365%
VGG16	SpecTrain	0.185566	0.796017	73.8081%
ResNet-152	Data P.	0.338845	0.892366	71.2139%
ResNet-152	Vanilla Model P.	0.254327	0.945588	70.3225%
ResNet-152	PipeDream	0.467527	0.979401	67.9287%
ResNet-152	SpecTrain	0.231724	0.924241	70.9936%
Inception v4	Data P.	0.804475	0.913155	69.1607%
Inception v4	Vanilla Model P.	0.858834	0.919470	68.6599%
Inception v4	PipeDream	0.864199	0.930320	68.5297%
Inception v4	SpecTrain	0.756939	0.874898	70.7732%
SNN	Data P.	0.431832	1.45124	50.9115%
SNN	Vanilla Model P.	0.766552	1.440200	50.6911%
SNN	PipeDream	0.810452	1.450239	50.4107%
SNN	SpecTrain	0.724402	1.406447	52.0733%
Transformer	Data P.	0.649287	0.660871	60.3265%
Transformer	Vanilla Model P.	0.655801	0.662379	60.0963%
Transformer	PipeDream	0.655877	0.662544	59.9760%
Transformer	SpecTrain	0.652193	0.662502	60.1362%
Residual LSTM	Data P.	0.347742	0.658583	66.0557%
Residual LSTM	Vanilla Model P.	0.459975	0.652651	65.0240%
Residual LSTM	PipeDream	0.467595	0.652948	64.8137%
Residual LSTM	SpecTrain	0.454813	0.652251	64.8137%

모델 병렬화는 데이터 병렬성에 비해 GPU 간 통신을 더 크게 줄여(평균 13.4x 감소, 테스트 모델 전반에 걸쳐 최대 528x), 효율성을 높인다.
오래됨 완화가 없는 파이프라인식 모델 병렬화는 특히 더 큰 모델에서 불안정하고 정확도를 저하시킬 수 있다.
SpecTrain은 오래됨을 완화하는 가중치 예측을 제공하여 대부분의 워크로드에서 데이터 병렬성과 유사한 학습 곡선 및 최종 정확도를 얻는다.
4-GPU 시스템에서 SpecTrain은 FCN/RNN 모델에 대해 데이터 병렬성 대비 최대 8.91x의 처리량 향상을 달성하며, 대부분의 경우 데이터 병렬성과 비교해 정확도 손실이 없다.
PipeDream과 비교해 SpecTrain은 가중치 저장 대기열에서 발생하는 추가 메모리 오버헤드를 피하고 안정성과 정확성을 유지한다.
Transformer 및 기타 CNN/RNN 모델에서 SpecTrain은 데이터 병렬성과 근접한 학습 손실 및 검증 정확도로 견고함을 유지하는 반면, Vanilla Model P. 및 PipeDream은 여러 모델에서 정확도 페널티를 야기할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.