QUICK REVIEW

[論文レビュー] D$^2$: Decentralized Training over Decentralized Data

Hanlin Tang, Xiangru Lian|arXiv (Cornell University)|Mar 19, 2018

Stochastic Gradient Optimization Techniques参考文献 35被引用数 185

ひとこと要約

D$^2$は、ワーカー間のデータ分散が大きい場合にも頑健な、D-PSGD の分散削減拡張であり、D-PSGD より速い収束を達成し、中央集権的 SGD の性能に近づく。

ABSTRACT

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D$^2$, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D$^2$ is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from $O\left({σ\over \sqrt{nT}} + {(nζ^2)^{\frac{1}{3}} \over T^{2/3}} ight)$ to $O\left({σ\over \sqrt{nT}} ight)$ where $ζ^{2}$ denotes the variance among data on different workers. As a result, D$^2$ is robust to data variance among workers. We empirically evaluated D$^2$ on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D$^2$ significantly outperforms D-PSGD.

研究の動機と目的

ワーカー間のデータが高度に非同一である場合の分散型トレーニングを動機付ける。
外部分散の影響を減らすために、D-PSGD に組み込まれた分散削減メカニズムを開発する。
収束保証を理論的に確立し、改善された収束速度を示す。
非均一なラベル分布を持つ画像分類タスクで D$^2$ を経験的に検証する。

提案手法

前回の反復から勾配と局所モデルを保存し、それらを現在の勾配とモデルと線形結合して、D-PSGD に分散削減コンポーネントを拡張する。
更新規則は、局所更新を現在の勾配と前回の勾配の組み合わせと共に集約し、ワーカー間のデータ分散を緩和する。
グローバルな更新式 X_{t+1} = (2X_t - X_{t-1} - γG(X_t; ξ_t) + γG(X_{t-1}; ξ_{t-1}))W。
平均反復が分散削減ダイナミクスに従い、グローバルなデータ分散 ζ^2 に依存せず改善された収束をもたらすことを示す。
リプシッツ連結性（勾配のリプシッツ性）、各ワーカーの分散の有界性、スペクトルギャップを持つ対称なコンセンサス行列、ネットワークトポロジの考慮を含む仮定を述べる。
D-PSGD を上回る収束速度を示す理論的収束保証と系数を提供する。

実験結果

リサーチクエスチョン

RQ1ワーカー間の大きなデータ分散に対して頑健な分散型 SGD アルゴリズムを設計できるか。
RQ2D-PSGD に組み込まれた分散削減戦略は、収束速度を O(σ/√(nT)) + O((nζ^2)^{1/3}/T^{2/3}) から O(σ/√(nT)) に改善するか。
RQ3D$^2$ はワーカー数に対して線形のスピードアップをどの条件下で達成するか。
RQ4ワーカーが重複のないデータセットや限定ラベルデータを保持する場合、D$^2$ は D-PSGD および中央集権型 SGD と比較して実証的にどうなるか。

主な発見

D$^2$ は O(σ/√(nT)) の収束速度を達成します。これは ζ^2（外部分散）に依存する D-PSGD の速度と異なる。
分散削減成分は、漸近的な速度におけるワーカー間のグローバルデータ分散への依存を排除します。
適切な条件下で、理論的にはワーカー数と共に線形のスピードアップを示します。
ワーカーごとに制限されたラベルデータを持つ画像分類タスクにおける実証実験は、D$^2$ が D-PSGD を大幅に上回り、中央集権的性能に近づくことを示し、特に非シャッフル（高分散）設定で顕著です。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。