QUICK REVIEW

[논문 리뷰] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi|arXiv (Cornell University)|2018. 03. 08.

Advanced Neural Network Applications인용 수 66

한 줄 요약

TicTac는 매개변수 전송의 거의 최적 순서를 강제하여 계산-통신 중첩을 극대화함으로써 분산 딥 러닝 처리량을 향상시키고, 추론 최대 37.7%, 학습 최대 19.2% 이득을 달성하며 스트래글러를 감소시킵니다.

ABSTRACT

State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. The iteration time in these communication-heavy systems depends on the computation time, communication time and the extent of overlap of computation and communication. In this work, we identify a shortcoming in systems with graph representation for computation, such as TensorFlow and PyTorch, that result in high variance in iteration time --- random order of received parameters across workers. We develop a system, TicTac, to improve the iteration time by fixing this issue in distributed deep learning with Parameter Servers while guaranteeing near-optimal overlap of communication and computation. TicTac identifies and enforces an order of network transfers which improves the iteration time using prioritization. Our system is implemented over TensorFlow and requires no changes to the model or developer inputs. TicTac improves the throughput by up to $37.7\\%$ in inference and $19.2\\%$ in training, while also reducing straggler effect by up to $2.3\ imes$. Our code is publicly available.

연구 동기 및 목표

DAG 기반 분산 DL에서 매개변수 서버(PS)와 함께 반복 시간 분산의 원인을 식별합니다.
네트워크 전송의 순서를 결정하여 계산–통신 중첩을 극대화하는 스케줄링 방법론을 개발합니다.
모델 변경 없이 스케줄링을 구현하기 위한 경량화된 제어 메커니즘을 TensorFlow에 내장합니다.

제안 방법

각 워커의 DAG에서 recv 연산의 거의 최적에 근접한 순서를 해결하도록 스케줄링 문제를 모델링합니다.
더 나은 중첩을 위해 매개변수 전송의 우선순위를 정하기 위해 TIC와 TAC 두 가지 휴리스틱을 제안합니다.
스케줄의 품질을 정량화하기 위해 스케줄링 효율성 메트릭과 두 가지 경계(U_Makespan 상한, L_Makespan 하한)를 정의합니다.
TIC와 TAC를 TensorFlow 1.8에 구현하고, 오프라인 우선순위 계산과 송신 측의 온라인 강제화를 gRPC로 수행합니다.

실험 결과

연구 질문

RQ1매개변수 전송의 순서가 매개변수 서버를 가진 모델 복제에서 반복 시간과 중첩에 어떤 영향을 미칩니까?
RQ2DAG 기반 스케줄링(TIC/TAC)이 학습 및 추론에서 스트래글러를 줄이고 처리량을 개선할 수 있습니까?
RQ3이 설정에서 스케줄링 효율성을 평가하기 위한 이론적 경계와 메트릭은 무엇입니까?

주요 결과

TicTac가 baselines 대비 추론에서 최대 37.7%, 학습에서 최대 19.2%의 처리량 이득을 제공합니다.
전송 순서가 더 예측 가능해짐에 따라 스트래글러 효과가 최대 2.3배까지 감소합니다.
네트워크 노드가 커질수록(작업자/PS가 더 많아질수록) 이득이 증가하지만, 통신이 계산을 너무 지배하면 스케줄링의 이점은 감소합니다.
TIC는 TAC에 거의 근접한 성능을 보이며, DAG 수준의 정보가 거의 최적 스케줄링에 충분하다는 것을 시사합니다.
모델이나 개발자 입력에 대한 변경이 필요하지 않으며, 시스템은 네트워크 전송 계층에서 순서를 강제합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.