QUICK REVIEW

[论文解读] Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms

Xiangyi Chen, Tiancong Chen|arXiv (Cornell University)|Jun 4, 2019

Stochastic Gradient Optimization Techniques参考文献 20被引用 32

一句话总结

该论文提出了一种基于噪声扰动的新型梯度修正机制，以弥合异构数据下基于中位数与基于均值的分布式优化算法之间的差距，即使在数据非独立同分布（non-iid）的情况下，也能实现 signSGD 和 medianSGD 的全局收敛。该方法保持了低通信复杂度，并在具有异构数据分布的实际联邦学习条件下确保收敛至驻点。

ABSTRACT

Recently, there is a growing interest in the study of median-based algorithms for distributed non-convex optimization. Two prominent such algorithms include signSGD with majority vote, an effective approach for communication reduction via 1-bit compression on the local gradients, and medianSGD, an algorithm recently proposed to ensure robustness against Byzantine workers. The convergence analyses for these algorithms critically rely on the assumption that all the distributed data are drawn iid from the same distribution. However, in applications such as Federated Learning, the data across different nodes or machines can be inherently heterogeneous, which violates such an iid assumption. This work analyzes signSGD and medianSGD in distributed settings with heterogeneous data. We show that these algorithms are non-convergent whenever there is some disparity between the expected median and mean over the local gradients. To overcome this gap, we provide a novel gradient correction mechanism that perturbs the local gradients with noise, together with a series results that provable close the gap between mean and median of the gradients. The proposed methods largely preserve nice properties of these methods, such as the low per-iteration communication complexity of signSGD, and further enjoy global convergence to stationary solutions. Our perturbation technique can be of independent interest when one wishes to estimate mean through a median estimator.

研究动机与目标

解决当各工作节点数据非独立同分布时，基于中位数与基于符号的分布式优化算法缺乏收敛性保证的问题。
弥合在数据异构条件下基于中位数（如 medianSGD）与基于均值（如 SGD）优化方法之间的理论差距。
在保证非独立同分布数据条件下收敛性的同时，保留诸如 signSGD 的低通信复杂度和 medianSGD 的 Byzantine 鲁棒性等优良特性。
构建一个统一的理论框架，通过符号中位数方向解释 signSGD 与 medianSGD 之间的隐含联系。
开发一种可证明有效的扰动技术，以在异构设置下对齐本地梯度的中位数与均值。

提出的方法

引入一种噪声扰动机制，通过向本地梯度添加受控噪声，以对齐其分布的中位数与均值。
理论分析表明，经扰动的梯度可通过减小本地梯度期望中位数与均值之间的差异，实现收敛。
在异构数据条件下推导 signSGD 与 medianSGD 的收敛界，证明其可全局收敛至驻点。
采用具有有界方差与利普希茨连续梯度假设的随机逼近框架，分析收敛速率。
证明该扰动技术可使基于中位数的估计器实现对均值的估计，该结果本身具有独立研究价值。
利用逐坐标中位数与符号运算，在保持通信效率的同时确保鲁棒性与收敛性。

实验结果

研究问题

RQ1当各工作节点的数据非独立同分布，违反标准 i.i.d. 假设时，signSGD 与 medianSGD 是否能实现全局收敛？
RQ2在数据异构条件下，导致基于中位数与基于均值的算法无法收敛的根本原因是什么？如何从理论上解决？
RQ3能否设计一种梯度修正机制，在保持 signSGD 通信效率的同时，确保在异构数据下的收敛性？
RQ4signSGD 与 medianSGD 之间是否存在理论联系，可解释其在扰动下共享的收敛行为？
RQ5噪声扰动能否有效弥合分布式非凸优化中梯度中位数与均值之间的差距？

主要发现

所提出的噪声扰动机制确保了在异构数据条件下，即使本地梯度的期望中位数与均值不同，signSGD 与 medianSGD 仍能实现全局收敛。
收敛速率为 $ O(d^{3/4}/T^{1/4}) $，与非凸分布式优化的最优统计速率一致。
通过保持符号运算实现的 1-bit 梯度压缩，该方法保留了 signSGD 的低通信复杂度。
理论分析证明，经扰动的梯度中位数收敛于梯度均值，从而弥合了基于中位数与基于均值算法之间的差距。
该扰动技术使基于中位数的估计器能够实现对均值的鲁棒估计，该结果在分布式估计中具有独立研究价值。
基于 MNIST 的实证验证表明，该方法在具有异构数据的实际联邦学习设置中具有有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。