QUICK REVIEW

[論文レビュー] signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Jeremy Bernstein, Jiawei Zhao|arXiv (Cornell University)|Oct 11, 2018

Retinal Imaging and Analysis被引用数 125

ひとこと要約

本論文は majority vote を用いる signSGD を紹介する。1-bit 通信、ノイズ下で収束し、最大 50% の Byzantine ワーカーに耐性を持つ頑健な分散最適化手法で、 large-scale タスクで実験的な speedup を示す。

ABSTRACT

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32 imes$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

研究の動機と目的

大規模なニューラルネットワークに対して、迅速で頑健、かつ通信効率の高い分散学習の必要性を動機付ける。
majority vote を用いた勾配符号ベースの単純な更新を提案し、通信を圧縮し障害耐性を高める。
ミニバッチ訓練と Byzantine 故障モデルに対する現実的な仮定の下で収束理論を提供する。
ImageNet や言語モデルに対する大規模実験を通じて実用的な利点とトレードオフを示す。

提案手法

ワーカーはミニバッチ勾配を計算し、モーメンタムの符号のみをパラメータサーバへ送信する。
サーバは多数決により符号を集約し、該当する多数符号をワーカーへ再送信する。
更新則は各ワーカーで momentum を用いて x <- x - eta * (sign(V) + lambda x) とする。
一峰性で対称な勾配ノイズを仮定した非凸目的関数に対する理論的収束解析を提供する（仮定 1-4）。
ワーカーのうち adversarial が α < 1/2 の割合を占める盲目的乗法的敵対者に対する頑健性を示す（定理 2）。
PyTorch で Signum を 1-bit テンソル圧縮として実装し、大規模実験で NCCL と比較する。

実験結果

リサーチクエスチョン

RQ1現実的な勾配ノイズ仮定の下で、非凸設定において majority vote を用いる signSGD は収束するか。
RQ2多数決投票が Byzant ine/敵対的ワーカーの挙動に対する頑健性にどう影響するか。
RQ3完全精度 SGD や他の圧縮方式と比較して、大規模データセットにおける通信および実時間性能の利点は何か。

主な発見

理論的収束速度 for mini-batch signSGD は、前述の仮定の下で SGD と O(1/sqrt(N)) の水準で一致する。
多数決は最大で 50% の敵対的ワーカーに対して Byzantine 故障耐性を提供し、劣化はあるが収束を保証する。
Imagenet 訓練で7–15 AWS p3.2xlarge 機で実測最大 25% の wall-clock 時間短縮を示し、わずかな一般化性能の低下を伴う。
Signum with majority vote は通信コストでいくつかの QSGD 変種よりも優れ、収束性を競争力ある水準に保つ。
実務的には勾配ノイズはしばしば一峰性かつ対称であり、仮定と手法の収束を裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。