QUICK REVIEW

[論文レビュー] DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

Hanlin Tang, Xiangru Lian|arXiv (Cornell University)|May 15, 2019

Stochastic Gradient Optimization Techniques被引用数 94

ひとこと要約

DoubleSqueezeは、二段階の圧縮（ワーカとパラメータサーバ）を用いた並列誤差補償SGDの収束性を分析・証明し、線形スピードアップと圧縮バイアス/ノイズに対する耐性の向上を示す。

ABSTRACT

A standard approach in large scale machine learning is distributed stochastic gradient training, which requires the computation of aggregated stochastic gradients over multiple nodes on a network. Communication is a major bottleneck in such applications, and in recent years, compressed stochastic gradient methods such as QSGD (quantized SGD) and sparse SGD have been proposed to reduce communication. It was also shown that error compensation can be combined with compression to achieve better convergence in a scheme that each node compresses its local stochastic gradient and broadcast the result to all other nodes over the network in a single pass. However, such a single pass broadcast approach is not realistic in many practical implementations. For example, under the popular parameter server model for distributed learning, the worker nodes need to send the compressed local gradients to the parameter server, which performs the aggregation. The parameter server has to compress the aggregated stochastic gradient again before sending it back to the worker nodes. In this work, we provide a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server. We show that the error-compensated stochastic gradient algorithm admits three very nice properties: 1) it is compatible with an \emph{arbitrary} compression technique; 2) it admits an improved convergence rate than the non error-compensated stochastic gradient methods such as QSGD and sparse SGD; 3) it admits linear speedup with respect to the number of workers. The empirical study is also conducted to validate our theoretical results.

研究の動機と目的

分散確率的勾配訓練における通信ボトルネックの削減を動機づける。
二段通信モデルにおいて、ワーカーとパラメータサーバの双方へ誤差補償を拡張する。
非凸損失下で提案アルゴリズムDoubleSqueezeの収束と線形スピードアップを証明する。
理論的収束性と実際の帯域節約を支持する実証的検証を提供する。

提案手法

ワーカーとパラメータサーバの双方が伝達勾配に対して誤差補償付き圧縮を行うDoubleSqueezeを導入する。
バイアスありまたはなしのいずれかである圧縮演算子Q_ω[·]を用い、情報損失を補うためにワーカーのδ^{(i)}とサーバのδという誤差ベクトルを用いる。
グローバル更新は x_{t+1}=x_t-γ∇f(x_t)+γξ_t-γΩ_{t-1}+γΩ_t と書けることを示す。ここで Ω_t と ξ_t は圧縮誤差と確率勾配の分散を捉える。
適切な仮定（リプシッツ連続な勾配、分散の有界性、圧縮誤差の有界性）の下で、DoubleSqueezeはワーカー数nに対して線形スピードアップを伴う収束率を達成することを証明する。
系の系は、O(σ/√(nT)) にεおよびTに依存する項を加えたレートを提供し、並列性と圧縮誤差への耐性の下で収束が速くなることを示す。

実験結果

リサーチクエスチョン

RQ1二段圧縮設定において、誤差補償をワーカーとパラメータサーバの双方に効果的に拡張できるか？
RQ2ワーカー数に対して並列・二段階誤差補償SGDは線形スピードアップを達成するか？
RQ3非凸損失下で、DoubleSqueezeは誤差補償なしSGDおよび他の圧縮SGD手法とどう比較されるか？
RQ4収束を保ちながらDoubleSqueeze内で使用できる圧縮演算子（バイアスあり/なし）は何か？
RQ5一般的なモデルとデータセットにおいて、実践的な帯域節約と収束挙動はどのようになるか？

主な発見

DoubleSqueezeはワーカー数nに比例した線形スピードアップで収束する。
本手法は誤差補償なしSGDより圧縮バイアスとノイズに対する耐性が高く、圧縮設定での収束を改善する。
両側で完全圧縮を行う場合、1回の反復あたりの通信はわずかnラウンドで、顕著な帯域節約を実現する。
理論的結果は非凸損失にも適用可能で、圧縮がある場合にSGD様のレートに一致する。
CIFAR-10のResNet-18での実証結果は、非圧縮SGDと同様の収束性を示しつつ、帯域幅制約下で1反復あたりの時間を速くし、帯域制約下で補償なし手法を上回る。
1-bitおよびtop-k圧縮を使用する場合、制限されたネットワーク条件下で大幅なスピードアップを維持しつつ、訓練と評価のパフォーマンスを競合させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。