QUICK REVIEW

[论文解读] signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Jeremy Bernstein, Jiawei Zhao|arXiv (Cornell University)|Oct 11, 2018

Retinal Imaging and Analysis被引用 125

一句话总结

本文提出 signSGD with majority vote，一种 1-bit 通信的鲁棒分布式优化方法，在有噪声的情况下收敛，且可容忍高达 50% 的拜占庭工作节点，在大型任务上实现经验上的加速。

ABSTRACT

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32 imes$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

研究动机与目标

Motivate the need for fast, robust, and communication-efficient distributed training for large neural networks.
Propose a simple gradient-sign based update with majority vote to compress communication and increase fault tolerance.
Provide convergence theory under realistic assumptions for mini-batch training and a Byzantine fault model.
Demonstrate practical benefits and trade-offs through large-scale experiments on ImageNet and language models.

提出的方法

Workers compute mini-batch gradients and send only the sign of their momentum to a parameter server.
The server aggregates signs via majority vote and broadcasts the majority sign back to workers.
Update rule uses x <- x - eta * (sign(V) + lambda x) with momentum on each worker.
Provide a theoretical convergence analysis under non-convex objectives with unimodal, symmetric gradient noise (Assumptions 1-4).
Show robustness to blind multiplicative adversaries with fraction alpha < 1/2 of workers being adversarial (Theorem 2).
Implement Signum in PyTorch with 1-bit tensor compression and compare to NCCL in large-scale experiments.

实验结果

研究问题

RQ1Does signSGD with majority vote converge in non-convex settings under realistic gradient noise assumptions?
RQ2How does majority voting affect robustness to Byzantine/adversarial worker behavior?
RQ3What communication and wall-clock performance advantages does the method offer on large-scale datasets compared to full-precision SGD and other compression schemes?

主要发现

Theoretical convergence rate for mini-batch signSGD matches SGD in O(1/sqrt(N)) under the stated assumptions.
Majority vote provides Byzantine fault tolerance up to 50% adversarial workers with degraded but guaranteed convergence.
Empirical results show up to 25% wall-clock time speedup on Imagenet training with 7–15 AWS p3.2xlarge machines, with slight generalisation degradation.
Signum with majority vote can outperform some QSGD variants in communication cost while maintaining competitive convergence.
Gradient noise in practice is often unimodal and symmetric, supporting the assumptions and convergence of the method.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。