QUICK REVIEW

[论文解读] On the Convergence of Local Descent Methods in Federated Learning

Farzin Haddadpour, Mehrdad Mahdavi|arXiv (Cornell University)|Oct 31, 2019

Stochastic Gradient Optimization Techniques参考文献 38被引用 169

一句话总结

本文分析在异质数据下的联邦学习中局部 GD/SGD 通过周期性平均的收敛性，证明收敛速率并识别梯度多样性界限如何实现方差约简和线性加速。它涵盖集中式和网络化设置，包括非凸和 PL 条件目标。

ABSTRACT

In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous distributed optimization, to solve nonconvex optimization problems in federated learning. Although scant research is available on the effectiveness of local SGD in reducing the number of communication rounds in homogeneous setting, its convergence and communication complexity in heterogeneous setting is mostly demonstrated empirically and lacks through theoretical understating. To bridge this gap, we demonstrate that by properly analyzing the effect of unbiased gradients and sampling schema in federated setting, under mild assumptions, the implicit variance reduction feature of local distributed methods generalize to heterogeneous data shards and exhibits the best known convergence rates of homogeneous setting both in general nonconvex and under {\pl}~ condition (generalization of strong-convexity). Our theoretical results complement the recent empirical studies that demonstrate the applicability of local GD/SGD to federated learning. We also specialize the proposed local method for networked distributed optimization. To the best of our knowledge, the obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization.

研究动机与目标

动机：研究具有异质数据分布的通信高效的联邦优化。
将带周期性平均的局部 GD/SGD 广义化到联邦设置中的非凸目标。
在界限的梯度多样性和 PL 条件下建立收敛速率。
将结果具体化到集中式、去中心化（网络化）和采样设备的联邦配置。

提出的方法

提出 Local Federated Descent (LFD) 结合周期性平均，参数化为 E（本地更新次数）、K（采样设备数量）和 q（设备权重）
将 LFD 专用于 Local Federated GD (LFGD) 与 Local Federated SGD (LFSGD)，覆盖全梯度和随机梯度两种设定。
引入加权梯度多样性 Λ(w,q) 以量化异质性，并导出学习率和 E 的收敛条件。
推导非凸目标与 PL 条件的非凸目标的收敛保证。
将分析扩展到设备与直接邻居通信的网络化分布式优化。

实验结果

研究问题

RQ1本地数据分片之间的异质性如何影响联邦学习中带周期性平均的局部下降的收敛性？
RQ2在何种条件下（学习率、局部更新次数、采样）局部 GD/SGD 在非凸 FL 设置中实现收敛？
RQ3在界限的梯度多样性下，非凸与 PL 条件目标的收敛速率是多少？
RQ4结果能否扩展到网络化（基于邻居）分布式优化和采样设备设置？

主要发现

在界限的梯度多样性下，带周期性平均的局部下降达到收敛，在不同情形下的速率与以往工作相符或有所提高。
对于在 PL 条件下的非凸目标，本文显示改进的速率，例如对 O(1/(KT)) 依赖，相较于某些先前界限。
收敛速率适用于集中式（参数服务器）和去中心化网络化 FL，以及全梯度和随机梯度设定。
学习率和本地更新选择依赖于梯度多样性，在多样性受控时实现线性加速。
通过在适当的超参数调优下展示类似方差减少的行为，而不需要显式的方差减少技术，从而使分析与经验发现一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。