QUICK REVIEW

[論文レビュー] Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

Yudong Chen, Lili Su|arXiv (Cornell University)|May 16, 2017

Stochastic Gradient Optimization Techniques参考文献 24被引用数 142

ひとこと要約

論文は Byzantine Gradient Descent を導入し、最大で約 2(1+ε)q の Byzantine ワーカーを許容し、誤差 ∼max{√(dq/N), √(d/N)} で指数的収束を log N ラウンドごとに達成する堅牢な分散学習アルゴリズム。

ABSTRACT

We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning -- a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function.

研究の動機と目的

分散統計学習を敵対的 (Byzantine) 故障が存在する状況、例えば Federated Learning のような設定で動機付ける。
Byzantine 故障を許容する堅牢な勾配集約法を開発する。
Byzantine 故障下での収束保証を証明し推定誤差を特徴づける。
提案手法の計算コストと通信コストを分析する。
このアプローチを説明するための線形回帰への応用を提供する。

提案手法

サーバがバッチ平均と幾何中央値に基づく堅牢な計画で勾配を集約する Byzantine Gradient Descent を提案する。
m 個のワーク機械を k バッチに分割し、勾配のバッチ平均を計算する。
これらの k バッチ平均の幾何中央値を計算して更新のための集約勾配を形成する。
強凸性とリプシッツ連結勾配仮定の下で η = L/(2M^2) を選んだステップサイズで勾配降下ステップを使用する。
formal conver gence theorem を提供する（log N ラウンドで指数収束し、誤差が √(dq/N) と √(d/N) によって増加する形式）
計算コストは各ワーカーで O((Nd/m) log N)、パラメータサーバーで O(md + qd log^3 N)、通信コストは O(md log N)。

実験結果

リサーチクエスチョン

RQ1各ワーカーが局所データを使用しつつ、分散学習アルゴリズムは Byzantin e (任意の) 故障を耐えられるか？
RQ2勾配を頑健に組み合わせ、Byzantine の影響を緩和しつつ収束を崩さない集約規則は何か？
RQ3分散学習における Byzantine 故障下の収束速度と統計的誤差境界は？
RQ4故障耐性と統計的精度のバランスをとるためにシステムパラメータ（k, q, m, N, d）はどう選ぶべきか？
RQ5線形回帰のような具体的な問題にこの手法はどう適用されるか？

主な発見

提案された Byzantine Gradient Descent 法は、任意の固定された ε>0 に対して最大で 2(1+ε)q ≤ m の Byzantine 故障を許容する。
推定量は O(log N) ラウンドで収束し、誤差境界は max{√(dq/N), √(d/N)}。
ミニマックス最適率 √(d/N) は Byzantine 設定で √q の因子まで達成可能。
総計算コストは各ワーカーで O((Nd/m) log N)、パラメータサーバーで O(md + qd log^3 N)、通信コストは O(md log N)。
線形回帰の場合、この枠組みは敵対的なワーカーに対して適用性と頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。