QUICK REVIEW

[論文レビュー] QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks

Dan Alistarh, Demjan Grubic|arXiv (Cornell University)|Oct 7, 2016

Stochastic Gradient Optimization Techniques被引用数 4

ひとこと要約

QSGD は、量子化された勾配更新を用いた、証明可能に収束する深層ニューラルネットワークのトレーニングを可能にする通信最適化型確率的勾配降下法である。QSGD は、モデル次元に応じて非線形に減少する通信コストを実現しながら、モデルの精度を維持またはわずかに向上させ、ImageNet における ResNet-152 のトレーニングを 16 GPU で最大 1.8 倍高速化する。

ABSTRACT

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties of this algorithm, and to its efficiency in the context of training deep neural networks. A fundamental barrier for parallelizing large-scale SGD is the fact that the cost of communicating the gradient updates between nodes can be very large. Consequently, lossy compression heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always provably converge, and it is not clear whether they are optimal. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions. QSGD allows the user to trade off compression and convergence time: it can communicate a sublinear number of bits per iteration in the model dimension, and can achieve asymptotically optimal communication cost. We complement our theoretical results with empirical data, showing that QSGD can significantly reduce communication cost, while being competitive with standard uncompressed techniques on a variety of real tasks. In particular, experiments show that gradient quantization applied to training of deep neural networks for image classification and automated speech recognition can lead to significant reductions in communication cost, and end-to-end training time. For instance, on 16 GPUs, we are able to train a ResNet-152 network on ImageNet 1.8x faster to full accuracy. Of note, we show that there exist generic parameter settings under which all known network architectures preserve or slightly improve their full accuracy when using quantization.

研究の動機と目的

深層ニューラルネットワークの分散確率的勾配降下法トレーニングにおける高い通信コストに対処すること。
標準的な仮定の下で収束を保証する圧縮方式を開発すること。
通信効率と収束速度の間で調整可能なトレードオフを実現すること。
分散トレーニングにおいて漸近的に最適な通信コストを達成すること。
多様なアーキテクチャとタスクにおいて、量子化がモデル精度を維持または向上させることを実証的に検証すること。

提案手法

QSGD は、通信前に各ノードで勾配更新を量子化する勾配圧縮スキームの族を導入する。
勾配要素あたりのビット数を制御したランダム量子化を用い、モデル次元に応じて非線形な通信コストを実現する。
勾配を有限な量子化ベクトルの集合へ写像する圧縮演算子を組み込み、収束に不可欠な方向情報の保持を図る。
勾配の有界性とリプシッツ連続性といった標準的な仮定の下で、理論的な収束保証を確立する。
ユーザーが勾配要素あたりのビット数を調整可能であり、通信コストと収束速度のバランスを取れる。
対称量子化と非対称量子化の両方をサポートし、量子化によって生じる誤差に対する理論的境界を提供する。

実験結果

リサーチクエスチョン

RQ1分散 SGD における通信コストを削減するための勾配量子化は、収束を保証できるか？
RQ2収束性とモデル精度を維持するために必要な勾配要素あたりの最小ビット数は何か？
RQ3QSGD は分散トレーニングにおいて漸近的に最適な通信コストを達成できるか？
RQ4精度が低下するにもかかわらず、量子化によりエンドツーエンドのトレーニング時間が短縮されるか？
RQ5勾配量子化がモデル精度を維持または向上させる条件は何か？

主な発見

QSGD はモデル次元に応じて非線形に減少する通信コストを実現し、スケーラブルな分散トレーニングを可能にする。
16 GPU で、QSGD は ImageNet における ResNet-152 のトレーニングを、標準的な非圧縮 SGD よりも 1.8 倍速く、完全な精度に到達する。
量子化によって性能が劣化しなかった。むしろ、複数のアーキテクチャにおいて、一部の設定ではテスト精度が維持またはわずかに向上した。
QSGD は漸近的に最適な通信コストを達成しており、これは大規模モデルの極限において、他の圧縮方式よりも優れていることを意味する。
実験的結果から、画像分類および音声認識タスクの両方でエンドツーエンドのトレーニング時間が顕著に短縮された。
QSGD フレームワークは、ResNet や自動音声認識用モデルを含む多様な深層学習モデルに対して、堅牢に機能する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。