QUICK REVIEW

[论文解读] Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Jianyu Wang, Gauri Joshi|arXiv (Cornell University)|Oct 18, 2018

Distributed and Parallel Computing Systems被引用 119

一句话总结

AdaComm，一种用于本地更新 SGD 的自适应通信策略，起始以较少的通信与平均，逐步增加通信以实现快速的误差收敛且最终误差较低；实验显示在完全同步 SGD 上可以实现多达 3x 的运行时加速，同时达到相同的最终训练损失。

ABSTRACT

Large-scale machine learning training, in particular distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that AdaComm can take $3 \ imes$ less time than fully synchronous SGD, and still reach the same final training loss.

研究动机与目标

动机并分析带局部更新和周期性平均的分布式 SGD 在误差与墙钟时间之间的收敛关系。
量化平均频率（tau）如何影响每次迭代的运行时间和误差地板。
开发一种自适应通信方案（AdaComm），在实际训练中优化权衡。
提供关于带可变 tau 和学习率的 PASGD 的理论收敛洞见。
在现实系统变异性下展示 AdaComm 对深度卷积神经网络的实际好处。

提出的方法

在局部计算时间随机且有随机通信延迟的情形下，对 PASGD 的每次迭代运行时间进行建模。
推导 PASGD 的误差-运行时间界作为 tau 的函数，得到最优 tau 的表达式。
提出 AdaComm，将训练划分为时间区间，并在每个区间内选择 tau 以最小化基于界限的误差。
给出不需要未知常数的实用 tau 更新规则（使用损失比率启发式）。
将分析扩展到包含衰减 tau 与自适应学习率的情形。
用 CIFAR-10/100 数据集在 VGG-16 与 ResNet-50 上的实验验证 AdaComm。

实验结果

研究问题

RQ1局部更新频率 tau 如何影响 PASGD 在墙钟时间上的真实收敛速度？
RQ2随时间改变 tau 的自适应通信是否在误差-运行时间权衡方面优于固定 tau 的方案？
RQ3作为时间、数据和系统延迟的函数，最优 tau 如何最小化梯度范数界？
RQ4在无权衡 Lipschitz 常数或梯度方差界等未知常数的情况下，如何通过实用启发式方法实现 AdaComm？
RQ5自适应通信策略是否能够泛化到不同的网络架构和学习率调度？

主要发现

AdaComm 通过以较大的 tau 开始并在训练进行中逐步降低 tau 来实现更快的墙钟时间收敛。
理论分析显示存在误差-运行时间权衡：较大的 tau 会降低每次迭代的运行时间，但可能提高误差地板；AdaComm 通过随时间调整 tau 来缓解。
在 VGG-16 与 ResNet-50 上的实验表明，AdaComm 在达到相同最终训练损失的同时，相较于完全同步 SGD 可实现最高约 ~3x 的运行时加速（在某些设置下测试准确性更好）。
一个闭式表达式在简化常数下识别了最优 tau*，为实际的通信频率自适应提供指导。
AdaComm 可以与学习率调度结合，并且适用于相关的通信高效 SGD 框架。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。