QUICK REVIEW

[論文レビュー] Asynchronous Byzantine Machine Learning (the case of SGD)

Georgios Damaskinos, El Mahdi El Mhamdi|arXiv (Cornell University)|Feb 22, 2018

Stochastic Gradient Optimization Techniques被引用数 26

ひとこと要約

Kardam は、非有界な通信遅延下でもほぼ確実に収束する、最初の非同期 Byzantine 耐性 stochastic gradient descent (SGD) アルゴリズムであり、最大 1/3 の Byzantine ワーカーを耐容する。これは、勾配の悪意ある更新を検出・抑制するためのリプシッツ連続性に基づく勾配フィルタリング機構と、勾配の古さに応じて重みを付ける遅延に配慮した減衰方式を組み合わせており、収束速度が f/n で有界である。ここで f は耐容可能な Byzantine ワーカー数、n はワーカーの総数である。

ABSTRACT

Asynchronous distributed machine learning solutions have proven very effective so far, but always assuming perfectly functioning workers. In practice, some of the workers can however exhibit Byzantine behavior, caused by hardware failures, software bugs, corrupt data, or even malicious attacks. We introduce \emph{Kardam}, the first distributed asynchronous stochastic gradient descent (SGD) algorithm that copes with Byzantine workers. Kardam consists of two complementary components: a filtering and a dampening component. The first is scalar-based and ensures resilience against $\frac{1}{3}$ Byzantine workers. Essentially, this filter leverages the Lipschitzness of cost functions and acts as a self-stabilizer against Byzantine workers that would attempt to corrupt the progress of SGD. The dampening component bounds the convergence rate by adjusting to stale information through a generic gradient weighting scheme. We prove that Kardam guarantees almost sure convergence in the presence of asynchrony and Byzantine behavior, and we derive its convergence rate. We evaluate Kardam on the CIFAR-100 and EMNIST datasets and measure its overhead with respect to non Byzantine-resilient solutions. We empirically show that Kardam does not introduce additional noise to the learning procedure but does induce a slowdown (the cost of Byzantine resilience) that we both theoretically and empirically show to be less than $f/n$, where $f$ is the number of Byzantine failures tolerated and $n$ the total number of workers. Interestingly, we also empirically observe that the dampening component is interesting in its own right for it enables to build an SGD algorithm that outperforms alternative staleness-aware asynchronous competitors in environments with honest workers.

研究の動機と目的

非有界な遅延を伴う現実的な分散機械学習システムにおいて、Byzantine 耐性非同期 SGD アルゴリズムの欠如に対処すること。
同期的調整やクorum待機を必要とせず、最大 1/3 の Byzantine ワーカーを耐容するソリューションを設計すること。
非同期性と悪意ある行動にもかかわらず、勾配フィルタリングと遅延に配慮した減衰を活用することで、高い収束効率を維持すること。
ほぼ確実な収束を理論的に証明し、Byzantine 故障数に応じて良好にスケーリングする収束速度を導出すること。

提案手法

コスト関数のリプシッツ連続性を活用したスカラーベースの勾配フィルタを導入し、Byzantine ワーカーからの勾配を検出し抑制する。
各勾配をその遅延に応じてスケーリングする汎用的な勾配重み付け方式（減衰関数）を採用し、古くなった更新の影響を低減する。
ノイズの多い遅延付き勾配下でも収束速度と安定性のバランスを取るために、適応的学習率スケジュールを採用する。
パラメータサーバーは、フィルタリングと減衰を適用した後でのみ勾配を集約するため、任意の Byzantine 動作に対しても耐性と収束性を保証する。
理論的分析により、ほぼ確実な収束を証明し、収束速度が O(µmax / √T · |ξ| · M + χ · µmax / T + d · σ² + 2DKσ / √d + K²D²) であることを導出する。ここで χ は遅延の影響を有界にする。
遅延と Byzantine ノイズの両方を考慮した、ラプラシアンに類似した議論と適応的学習率を用いる、革新的な収束分析フレームワークを採用する。

実験結果

リサーチクエスチョン

RQ1非有界な通信遅延下でも、非同期 SGD アルゴリズムが収束性と Byzantine 故障に対する耐性を保つことができるか？
RQ2同期的調整やクorum待機を必要とせず、悪意ある勾配をどのように除外できるか？
RQ3収束性を維持しつつ、耐性を高めるために、古くなった勾配を最適に重み付ける方法は何か？
RQ4減衰機構は、Byzantine 耐性とは独立して、誠実なワーカー環境でも性能向上をもたらすか？
RQ5このような耐性を持つ非同期 SGD アルゴリズムの理論的収束速度は何か？また、Byzantine ワーカー数に応じてどのようにスケーリングされるか？

主な発見

Kardam は、非同期性と最大 1/3 の Byzantine ワーカーが存在する状況下でも、非有界な通信遅延下においてもほぼ確実に収束を保証する。
収束速度は f/n で有界であり、f が耐容可能な Byzantine ワーカー数、n がワーカーの総数である。この結果は、耐性コストの良好なスケーリングを示している。
実験的に、Kardam は学習プロセスに追加のノイズを導入しないことが確認され、Byzantine 耐性機構がモデル品質を低下させないことを示している。
減衰部品のみでも、DynSGD などのベースライン非同期 SGD メソッドを上回る性能を発揮し、誠実だが遅延の大きいワーカーが存在する環境で特に顕著である。
指数関数的減衰関数（Λ(τ) = exp(−αβ√τ)）は、逆線形関数（Λ(τ) = 1/(1+τ)）よりも理論的および実験的に収束が速いことが確認された。
CIFAR-100 および EMNIST において、Kardam はわずかな遅延（f/n 比例）で競争力のある精度と損失を達成しており、実用的妥当性が裏付けられている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。