QUICK REVIEW

[论文解读] signSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein, Yu-Xiang Wang|arXiv (Cornell University)|Feb 13, 2018

Stochastic Gradient Optimization Techniques被引用 88

一句话总结

signSGD 仅传输梯度符号以在分布式非凸优化中降低通信成本，并实现接近 SGD 的收敛；多数投票在双向实现 1 位通信并具备可证明的方差降低。

ABSTRACT

Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative $\ell_1/\ell_2$ geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD .

研究动机与目标

阐明在大规模分布式深度学习中梯度通信的瓶颈。
提出基于符号的梯度更新以在压缩下实现接近 SGD 的收敛。
在带偏符号更新的非凸优化下发展理论。
扩展到分布式设置，在双向实现多数投票和 1-bit 通信。
探索带动量的变体（Signum）及其收敛性与实际性能。

提出的方法

提出 signSGD，其中更新仅使用随机梯度的符号。
通过对动量平均梯度应用符号来提出 Signum。
在满足坐标逐项光滑和逐坐标方差界的非凸假设下分析收敛性。
提出分布式多数投票方案，其中参数服务器汇总来自 M 个工作者的 1-bit 梯度符号。
给出收敛界，证明在某些梯度和噪声密度区间下收敛速度与 SGD 相当。
将理论框架扩展为包含 Signum 动量并推导在热身期内的收敛性。

实验结果

研究问题

RQ1在非凸优化中，在哪些条件下基于符号的梯度方法能达到与 SGD 相当的收敛速度？
RQ2跨坐标的梯度和噪声密度如何影响 signSGD 和 Signum 的性能？
RQ3多数投票是否能在不牺牲收敛的前提下实现双向有效的 1 位通信？
RQ4动量对符号基方法中的偏差-方差权衡有何影响？
RQ5与 Adam 和 SGD 相比，基于符号的方法在 CIFAR-10、ImageNet 等大规模数据集上的实验表现如何？

主要发现

signSGD 在某些 L1/L2 几何与噪声条件下可以达到与 SGD 相匹配的收敛速度。
在分布式设置中，多数投票在双向实现 1 位通信，并在单峰对称噪声下方差约降低到 sqrt(M) 的尺度。
Signum（带动量的符号）收敛，且在大模型上可达到与 Adam 相近的性能，准确度具有竞争力。
理论强调梯度和噪声密度：当梯度密集时，signSGD 对稀疏的高方差分量更鲁棒；当梯度稀疏时，SGD 对曲率与噪声的鲁棒性可能更高。
在 CIFAR-10 和 Imagenet 的实验结果显示 signSGD/Signum 的表现与 SGD/Adam 相当，Signum 实现了类似 ImageNet 的结果，且在准确率上可能接近 Adam。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。