QUICK REVIEW

[論文レビュー] Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning

Philippenko, Constantin, Aymeric Dieuleveut|arXiv (Cornell University)|Aug 2, 2023

Stochastic Gradient Optimization Techniques参考文献 49被引用数 102

ひとこと要約

本稿は、分散最小二乗回帰における非バイアス圧縮の洗練された分析を提供し、同じ分散バウンドを持つ圧縮方式であっても、正則性および座標相関の違いにより、収束速度に差が生じることを示している。収束は加法的ノイズの極限共分散に依存し、古典的レートを一般化しており、リプシッツ正則性を欠くにもかかわらず、量子化は射影ベース手法と漸近的に同等の性能を達成することが明らかになった。

ABSTRACT

In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected Hölder regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $σ^2 \mathrm{Tr}(H H^{-1}) / K = σ^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.

研究の動機と目的

同じ分散バウンドを持つ異なる非バイアス圧縮演算子が分散学習における収束速度にどのように影響を与えるかを理解すること。
圧縮器の正則性（例：リプシッツ連続性対 Hölder 継続性）および座標相関が収束行動に果たす役割を分析すること。
非 i.i.d. クライアントデータおよびメモリベース最適化を伴う非均質なフェデレーテッドラーニング設定への分析の拡張。
加法的ノイズに起因する圧縮による極限共分散に依存する漸近的収束レートを導出すること。
同じ分散仮定を持つ圧縮器を区別できるように、最悪ケース分析を超えた洗練された理論枠組みを提供すること。

提案手法

弱い正則性仮定（期待 Hölder 継続性）を持つ確率的フィールドを用いた、二次関数を最小化する一般化された確率的近似アルゴリズムを分析する。
漸近的収束を支配する極限ノイズ共分散行列 $ C^\infty_{\text{ania}} = \lim_{k \to \infty} \mathbb{E}[\xi^{\text{add}}_k \otimes \xi^{\text{add}}_k] $ を導入する。
パラメータ距離とメモリ項のずれを組み合わせたリャプノフ関数を用いて、減少ステップサイズのもとでの収束を証明する。
条件付き中心極限定理を適用し、$ \sqrt{K} \eta_K \to \mathcal{N}(0, H_F^{-1} C^\infty_{\text{ania}} H_F^{-1}) $ を示し、収束をノイズ共分散と結びつける。
有界な分散増加 $ \omega $ を持つ非バイアス演算子として圧縮をモデル化し、$ C^\infty_{\text{ania}} $ に与える影響を分析する。
2つのフェデレーテッドラーニングフレームワークを検討する：(1) メモリあり、(2) メモリなし、それぞれクライアントの非均質性およびコンセプトシフトを想定する。

実験結果

リサーチクエスチョン

RQ1同じ分散バウンドを持つ圧縮方式は、どのように収束行動に差を生じるか？
RQ2圧縮器の正則性（例：リプシッツ対 Hölder 継続性）が収束レートに果たす役割は何か？
RQ3座標間の相関構造は、分散最小二乗回帰における収束にどのように影響するか？
RQ4極限ノイズ共分散 $ C^\infty_{\text{ania}} $ は、圧縮戦略およびクライアントの非均質性にどのように依存するか？
RQ5メモリベース手法は、標準的な圧縮アルゴリズムと比較して、非均質性の影響を軽減し、収束を改善できるか？

主な発見

漸近的収束レートは $ \text{Tr}(C^\infty_{\text{ania}} H_F^{-1}) / K $ によって支配され、古典的レート $ \sigma^2 d / K $ を一般化している。
リプシッツ正則性を欠くにもかかわらず、量子化ベースの圧縮器は、同じ極限ノイズ共分散を示すため、射影ベースの圧縮器と漸近的に同等の収束レートを達成する。
確率 $ h/d $ で部分参加を行う場合や Rand-h 圧縮においては、同じ分散条件を満たすが、悪条件問題においてより頑健な収束を示す。
特徴量が標準化されている場合、量子化はスパarsification やランダム座標選択を上回るが、特徴量が独立かつ正規化されている場合には、これらと比較して劣る。
クライアントの非均質性およびコンセプトシフトが存在する状況では、メモリベース手法が有効なノイズ共分散 $ C^\infty_{\text{ania}} $ を低減させ、非メモリバージョンよりも収束が改善される。
極限ノイズ共分散 $ C^\infty_{\text{ania}} $ は、$ C((C_i, p_{\Theta'_i})_{i=1}^N) $ として明示的に特徴付けられ、ここで $ p_{\Theta'_i} $ は勾配偏差 $ g^*_{k,i} - \nabla F_i(w^*) $ の分布を表す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。