QUICK REVIEW

[论文解读] signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

Jeremy Bernstein, Jiawei Zhao|arXiv (Cornell University)|Oct 12, 2018

Adversarial Robustness in Machine Learning被引用 28

一句话总结

本文提出 signSGD with majority vote，一种用于分布式深度学习的通信高效且拜占庭容错的优化方法。通过仅传输梯度符号并使用多数投票聚合，该方法将通信量减少32倍，在大规模和小批量设置下实现收敛，并可容忍高达50%的恶意工作者，相较于NCCL在15台机器上训练ResNet-50时提速25%。

ABSTRACT

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32 imes$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. We model adversaries as those workers who may compute a stochastic gradient estimate and manipulate it, but may not coordinate with other adversaries. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

研究动机与目标

解决使用高参数量模型进行大规模分布式深度学习时的通信瓶颈问题。
开发一种对高达50%的拜占庭式工作者（可能篡改梯度）具有鲁棒性的优化方法。
在不牺牲收敛性或模型准确率的前提下，降低分布式训练中的通信成本。
设计一种实用系统，在真实训练工作负载中优于NCCL等最先进的集体通信库。

提出的方法

工作者仅传输其梯度向量的符号，而非全精度梯度，每轮迭代通信量减少32倍。
参数服务器通过梯度符号的多数投票方式聚合更新，确保单个工作者无法主导更新。
该方法在大规模小批量设置下均被证明可在自然条件下收敛，并将收敛性扩展至Adam的参数设置。
将恶意工作者建模为无法相互协调的独立实体，可计算并操纵随机梯度估计，但无法协同行动。
系统在PyTorch中实现，采用集中式参数服务器架构，支持实际部署与基准测试。
框架在15台AWS p3.2xlarge机器上基于ImageNet的ResNet-50训练进行评估，与NCCL进行性能对比。

实验结果

研究问题

RQ1基于符号的梯度聚合方法是否能在大规模和小批量设置下实现深度学习的收敛？
RQ2多数投票聚合在分布式训练中对拜占庭式工作者的鲁棒性有何影响？
RQ3在不降低模型准确率或收敛速度的前提下，通信量最多可减少多少？
RQ4简单的signSGD with majority vote框架是否能在真实训练场景中优于NCCL等优化通信库？

主要发现

与全精度分布式SGD相比，signSGD with majority vote 每轮迭代的通信量减少了32倍。
该方法在大规模小批量设置下均实现收敛，且作为副产品，其收敛性也适用于Adam的参数设置。
系统对高达50%的工作者恶意行为具有鲁棒性，因为多数投票机制可防止任一工作者主导更新。
在15台AWS p3.2xlarge机器上，尽管参数服务器仅运行在单台机器上，该框架仍使ImageNet上ResNet-50的训练时间相比NCCL减少了25%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。