QUICK REVIEW

[论文解读] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi|arXiv (Cornell University)|Mar 8, 2018

Advanced Neural Network Applications被引用 66

一句话总结

TicTac 通过强制近乎最优的参数传输顺序以最大化计算与通信的重叠，提升分布式深度学习吞吐量，在推理方面最高提升37.7%，在训练方面最高提升19.2%，同时减小了拖尾。

ABSTRACT

State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. The iteration time in these communication-heavy systems depends on the computation time, communication time and the extent of overlap of computation and communication. In this work, we identify a shortcoming in systems with graph representation for computation, such as TensorFlow and PyTorch, that result in high variance in iteration time --- random order of received parameters across workers. We develop a system, TicTac, to improve the iteration time by fixing this issue in distributed deep learning with Parameter Servers while guaranteeing near-optimal overlap of communication and computation. TicTac identifies and enforces an order of network transfers which improves the iteration time using prioritization. Our system is implemented over TensorFlow and requires no changes to the model or developer inputs. TicTac improves the throughput by up to $37.7\\%$ in inference and $19.2\\%$ in training, while also reducing straggler effect by up to $2.3\ imes$. Our code is publicly available.

研究动机与目标

找出基于 DAG 的分布式 DL（带 Parameter Servers，PS）的迭代时间方差源。
提出一种调度方法，按顺序安排网络传输以最大化计算与通信的重叠。
在 TensorFlow 内提供一个轻量级的强制执行机制，用以实现调度且无需对模型进行修改。

提出的方法

将调度问题建模为对每个工作节点的 DAG 中 recv 操作的近乎最优可解的排序。
提出两种启发式算法 TIC 与 TAC，用以优先安排参数传输以获得更好重叠。
定义一个调度效率指标以及两个界限（上界 U_Makespan 和下界 L_Makespan）以量化调度质量。
在 TensorFlow 1.8 中实现 TIC 与 TAC，离线计算优先级，发送端通过 gRPC 在线强制执行。

实验结果

研究问题

RQ1参数传输顺序如何影响带有 Parameter Server 的 Model Replica 的迭代时间与重叠？
RQ2基于 DAG 的调度（TIC/TAC）能否减少拖尾并提升训练和推理的吞吐量？
RQ3在此设定中，评估调度效率的理论界限与指标是什么？

主要发现

与基线相比，使用 TicTac 在推理方面吞吐量提升最高达到37.7%，在训练方面提升最高达到19.2%。
由于更可预测的传输顺序，拖尾效应最多减少至原来的2.3倍。
当网络规模增大（更多工作节点/PS）时收益增加，但若通信对计算的占比过大，调度收益会下降。
TIC 的表现接近 TAC，表明 DAG 级信息通常足以实现近似最优的调度。
不需要对模型或开发者输入进行任何修改；系统在网络传输层强制执行顺序。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。