QUICK REVIEW

[論文レビュー] A Survey of Quantization Methods for Efficient Neural Network Inference

Amir Gholami, Sehoon Kim|arXiv (Cornell University)|Mar 25, 2021

Neural Networks and Applications被引用数 36

ひとこと要約

この調査はニューラルネットワーク推論の量子化手法をレビューし、均一量子化と非均一量子化、校正戦略、粒度、ファインチューニング手法、ハードウェアへの影響を詳述する。 accuracy, efficiency, deployability のトレードオフを、ハードウェアプラットフォーム全体で強調する。

ABSTRACT

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

研究の動機と目的

ニューラルネットワークに適用される量子化の歴史的背景と基礎概念を要約する。
主な量子化手法（均一対非均一、対称対非対称）とそれらのトレードオフを特徴づける。
活性化と重みの校正、粒度、および動的対静的アプローチを説明する。
量子化を用いたファインチューニング戦略（QAT対PTQ）と勾配処理を議論する。
エッジ展開におけるハードウェアへの影響と実務的考慮事項を強調する。

提案手法

量子化演算子と浮動小数点から低精度値への写像を定義する（例：Q(r) とデクオンプリケーション）。
対称量子化 vs 非対称量子化、全範囲量子化 vs 制限範囲量子化を、ゼロ点の取り扱いとともに区別する。
活性化の静的校正と動的校正を説明し、精度とオーバーヘッドへの影響を説明する。
層ごと、グループごと、チャネルごと、サブチャネルごとの粒度オプションとその影響を説明する。
均一量子化対非均一量子化を要約し、学習可能/学習不能な量子化器および最適化ベースのアプローチを含む。
直通推定機(STE)を用いた量子化を含む量子化対応学習(QAT)と、STE 以外の方法、さらにクリップ範囲の学習（PACT, LSQ, LSQ+）を提示する。

実験結果

リサーチクエスチョン

RQ1ニューラルネットワーク推論の主要な量子化戦略は何で、それぞれの精度と効率のトレードオフはどうか。
RQ2校正、粒度、量子化タイプ（均一対非均一）が実モデルとハードウェアでの性能にどう影響するか。
RQ3効果的なファインチューニング戦略(QAT対PTQ)と量子化ネットワークの勾配処理アプローチは何か。
RQ4ハードウェアの考慮事項がエッジデバイスの実用的な量子化の選択にどう影響するか。

主な発見

均一量子化は単純さとハードウェア効率性のため事実上の標準であるが、非均一量子化は場合によって潜在的な精度向上を提供する。
チャネルごと (チャンネル単位) の量子化は重みの分解能と精度を向上させる；層ごとの量子化は性能を低下させる可能性がある。
動的な活性化レンジ校正は高精度をもたらすが実行時オーバーヘッドを伴う；静的校正は安価だが一般に精度は劣る。
STEを用いたQATは量子化を伴う訓練の主流手法である；代替の非STE手法と学習可能なクリッピングレンジも有望である。
非均一量子化は分布をよりうまく捉えられるが一般ハードウェアへのデプロイは難しい；本論文はデプロイメントにおける均一アプローチの実用性を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。