QUICK REVIEW

[论文解读] Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning

Philippenko, Constantin, Aymeric Dieuleveut|arXiv (Cornell University)|Aug 2, 2023

Stochastic Gradient Optimization Techniques参考文献 49被引用 102

一句话总结

本文对分布式最小二乘回归中的无偏压缩进行了精细化分析，表明即使方差界相同的压缩方案，也可能因正则性差异和坐标相关性不同而表现出不同的收敛速率。研究证明收敛性取决于加性噪声的极限协方差，从而推广了经典收敛速率，并揭示了尽管量化方法缺乏Lipschitz正则性，其渐近性能仍与基于投影的方法相当。

ABSTRACT

In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected Hölder regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $σ^2 \mathrm{Tr}(H H^{-1}) / K = σ^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.

研究动机与目标

理解不同无偏压缩算子（尽管具有相同的方差界）在分布式学习中为何导致不同的收敛速率。
分析压缩器正则性（如Lipschitz连续性与H"older连续性）以及坐标相关性在塑造收敛行为中的作用。
将分析扩展至具有非独立同分布客户端数据和基于记忆的优化的异构联邦学习设置。
推导依赖于压缩引起的加性噪声极限协方差的渐近收敛速率。
提供一个精细化的理论框架，超越最坏情况分析，区分具有相同方差假设的压缩器。

提出的方法

分析一个用于最小化二次函数的一般随机逼近算法，使用弱正则性假设（期望H"older连续性）的随机场。
引入极限噪声协方差矩阵 $ C^\infty_{\text{ania}} = \lim_{k \to \infty} \mathbb{E}[\xi^{\text{add}}_k \otimes \xi^{\text{add}}_k] $，其决定渐近收敛性。
使用结合参数距离与记忆项偏差的李雅普诺夫函数，在递减步长下证明收敛性。
应用条件中心极限定理，证明 $ \sqrt{K} \eta_K \to \mathcal{N}(0, H_F^{-1} C^\infty_{\text{ania}} H_F^{-1}) $，将收敛性与噪声协方差联系起来。
通过有界方差增长 $ \omega $ 的无偏算子建模压缩，并分析其对 $ C^\infty_{\text{ania}} $ 的影响。
考虑两种联邦学习框架：(1) 带记忆；(2) 不带记忆，适用于客户端异构性和概念漂移场景。

实验结果

研究问题

RQ1具有相同方差界的压缩方案在收敛行为上如何不同？
RQ2压缩器正则性（如Lipschitz与H"older连续性）在决定收敛速率中的作用是什么？
RQ3在分布式最小二乘回归中，压缩坐标之间的相关性结构如何影响收敛性？
RQ4极限噪声协方差 $ C^\infty_{\text{ania}} $ 如何依赖于压缩策略和客户端异构性？
RQ5基于记忆的方法是否能降低异构性的影响并相比标准压缩算法提升收敛性？

主要发现

渐近收敛速率由 $ \text{Tr}(C^\infty_{\text{ania}} H_F^{-1}) / K $ 决定，推广了经典最小二乘回归的 $ \sigma^2 d / K $ 速率。
尽管在平方期望下不具备Lipschitz正则性，基于量化的压缩器由于具有相似的极限噪声协方差，仍能达到与基于投影的压缩器相当的渐近收敛速率。
在部分参与概率为 $ h/d $ 且采用 Rand-h 压缩时，满足相同的方差条件，但在病态问题中表现出更强的鲁棒性。
当特征被标准化时，量化优于稀疏化和随机坐标选择；然而，当特征独立且归一化时，量化性能劣于这些替代方案。
在客户端异构性和概念漂移存在的情况下，基于记忆的方法可降低有效噪声协方差 $ C^\infty_{\text{ania}} $，从而相比非记忆变体实现更优的收敛性。
极限噪声协方差 $ C^\infty_{\text{ania}} $ 显式表征为 $ C((C_i, p_{\Theta'_i})_{i=1}^N) $，其中 $ p_{\Theta'_i} $ 是梯度偏差 $ g^*_{k,i} - \nabla F_i(w^*) $ 的分布。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。