QUICK REVIEW

[论文解读] On the Convergence of FedAvg on Non-IID Data

Xiang Li, Kaixuan Huang|arXiv (Cornell University)|Jul 4, 2019

Privacy-Preserving Technologies in Data参考文献 48被引用 1,011

一句话总结

本论文证明了在强凸且光滑的问题上，FedAvg 对非 IID 数据的 O(1/T) 收敛速率，并分析了部分设备参与，以及在 E>1 时学习率衰减对收敛是必要的。

ABSTRACT

Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging ( exttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of exttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for exttt{FedAvg} on non-iid data: the learning rate $η$ must decay, even if full-gradient is used; otherwise, the solution will be $Ω(η)$ away from the optimal.

研究动机与目标

激励在非-IID 数据和设备参与受限的情况下进行联邦学习。
在不依赖 iid/数据激活假设的前提下，为 FedAvg 建立收敛性保证。
表征局部更新 (E) 与采样方案如何影响收敛性与通信成本。
表明在存在多个本地更新时，FedAvg 需要学习率衰减的必要性。

提出的方法

将 FedAvg 建模为全局目标 F(w)=sum p_k F_k(w) 的分布式优化。
分析全量和部分设备参与，引入采样方案 S_t 与平均规则。
在 F_k 满足 L-smooth 且 μ-strongly convex 假设下，证明 O(1/T) 收敛。
推导通信轮数的显式界，作为 E、K 与数据异质性 Γ 的函数。
证明在 E>1 时，即使使用全梯度，也必须让学习率衰减以确保收敛。
提出并比较实现收敛的采样/平均方案。

实验结果

研究问题

RQ1在设备间数据为非 IID 且并非每轮都有设备参与的情况下，FedAvg 是否能实现收敛性保证？
RQ2在非 IID 数据下，强凸性与光滑性条件下，FedAvg 的收敛速率是多少？
RQ3局部更新步数 (E) 与参与规模 (K) 如何影响收敛速度与通信成本之间的权衡？
RQ4在非 IID 设置下，FedAvg 收敛是否必须学习率衰减，及其原因？
RQ5哪些采样和平均方案能确保收敛，以及数据异质性与数据平衡如何影响它们？

主要发现

FedAvg 在强凸且光滑的问题上，针对非 IID 数据获得 O(1/T) 收敛速率。
部分设备参与通过较高方差降低收敛速度，但在合适条件下仍然收敛。
对 E 的最佳选择在于平衡通信与收敛；既不能太小也不能普遍认为太大总是最好。
数据异质性（非 IID）减慢收敛，与经验观察一致。
即使使用全梯度，当 E>1 时，学习率衰减对于收敛到最优解也是必要的。
某些采样/平均方案（例如带放回的非均匀采样）在非 IID 设置下也可以达到 O(1/T) 速率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。