Skip to main content
QUICK REVIEW

[論文レビュー] High-Accuracy Low-Precision Training

Christopher De, Megan Leszczynski|arXiv (Cornell University)|Mar 9, 2018
Medical Imaging and Analysis参考文献 20被引用数 74
ひとこと要約

HALPは SVRG とビットセンタリングを組み合わせ、固定の低精度を用いた高精度なトレーニングを実現し、CPU上での収束はフル精度SVRGと同等ながら高速に動作します。3–4倍のスピードアップを示し、深層学習タスクでの検証性能も良好です。

ABSTRACT

Low-precision computation is often used to lower the time and energy cost of machine learning, and recently hardware accelerators have been developed to support it. Still, it has been used primarily for inference - not training. Previous low-precision training algorithms suffered from a fundamental tradeoff: as the number of bits of precision is lowered, quantization noise is added to the model, which limits statistical accuracy. To address this issue, we describe a simple low-precision stochastic gradient descent variant called HALP. HALP converges at the same theoretical rate as full-precision algorithms despite the noise introduced by using low precision throughout execution. The key idea is to use SVRG to reduce gradient variance, and to combine this with a novel technique called bit centering to reduce quantization error. We show that on the CPU, HALP can run up to $4 \times$ faster than full-precision SVRG and can match its convergence trajectory. We implemented HALP in TensorQuant, and show that it exceeds the validation performance of plain low-precision SGD on two deep learning tasks.

研究の動機と目的

  • モデル訓練時の時間とエネルギーコストを削減するための低精度トレーニングを動機づける。
  • 固定ビット幅で、フル精度に近い精度を維持または達成するアルゴリズムの開発。
  • 低精度訓練における量子化ノイズと勾配分散の緩和方法を分析。

提案手法

  • ビットセンタリングなしの低精度SVRG変種LP-SVRGを提案し、量子化により収束が制限される。
  • 最適化が進むにつれて量子化ノイズを減らすため、動的に再中心化・再スケールするビットセンタリングを適用するHALPを導入。
  • HALPはSVRGと同様の線形収束を、固定ビット表現を使って任意に高い精度まで保持することを証明。
  • 線形モデルの実用実装を提供し、低精度で勾配と更新を計算する方法を示す。
  • TensorQuantで実装・評価し、深層学習とロジスティック回帰タスクでLP-SVRGおよびLP-SGDと比較。
Figure 1: Linear regression on a synthetic dataset with 100 features and 1000 examples generated by scikit-learn’s make_regression generator [ 22 ] . The epoch length was set to $T=2000$ , twice the number of examples, and the learning rates $\alpha$ and scale factors $\delta$ were chosen using grid
Figure 1: Linear regression on a synthetic dataset with 100 features and 1000 examples generated by scikit-learn’s make_regression generator [ 22 ] . The epoch length was set to $T=2000$ , twice the number of examples, and the learning rates $\alpha$ and scale factors $\delta$ were chosen using grid

実験結果

リサーチクエスチョン

  • RQ1強凸問題で、低精度トレーニングアルゴリズムはフル精度SVRGと同じ収束速度で収束できるか?
  • RQ2ビットセンタリングは固定ビットの低精度演算で任意に高い精度をHALPに達成させるか?
  • RQ3実タスクでのHALPと標準の低精度 SGD/SVRG の実用的なスループットと精度のトレードオフは何か?
  • RQ4深層学習モデルとロジスティック回帰における訓練損失と検証精度の観点でLP-SVRGとHALPの性能はどうか?

主な発見

Algorithm全体の実行時間# FP 演算# LP 演算# LP ビット
SGDO(\u0001κ log(1/ε)/ε)O(\u0001κ/ε)
SVRGO((N+κ) log^2(1/ε))O((N+κ) log(1/ε))
LP-SVRGO((N+κ) log^2(1/ε))O(N log(1/ε))O(κ log(1/ε))
HALPO(N log^2(1/ε)+κ log(κ) log(1/ε))O(N log(1/ε))O(κ log(1/ε))2 log(O(κ))
  • LP-SVRG converges linearly to a precision-limited neighborhood determined by quantization delta, matching SVRG until hitting an accuracy floor.
  • HALP achieves linear convergence down to arbitrarily high accuracy by using bit centering to shrink quantization noise as optimization proceeds.
  • On CPU, HALP runs up to 3× faster than plain SVRG on MNIST and up to 4× faster on a synthetic 10k-feature dataset, while matching or exceeding SVRG validation performance on deep models.
  • In deep learning experiments, 8-bit HALP closely matches full-precision SVRG training loss for CNNs and LSTMs, and often matches or improves validation metrics relative to LP-SVRG/LP-SGD.
  • HALP outperforms LP-SVRG and LP-SGD on multi-class logistic regression tasks in accuracy while achieving up to 4× faster iterations; HALP remains within 25% per-epoch of LP-SGD.
Figure 2: A diagram of the bit scaling operation in HALP. As the algorithm converges, we are able to bound the solution within a smaller and smaller ball. Periodically, we re-center the points that our low-precision model can represent so they are centered on this ball, and we re-scale the points so
Figure 2: A diagram of the bit scaling operation in HALP. As the algorithm converges, we are able to bound the solution within a smaller and smaller ball. Periodically, we re-center the points that our low-precision model can represent so they are centered on this ball, and we re-scale the points so

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。