QUICK REVIEW

[論文レビュー] Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

Sangeetha Siddegowda, Marios Fournarakis|arXiv (Cornell University)|Jan 20, 2022

Neural Networks and Applications被引用数 24

ひとこと要約

本論文は、低遅延・省エネルギー推論を実現するためにニューラルネットワークを8-bit固定小数点へ量子化する Qualcomm のオープンソースツールキット AIMET を紹介し、PTQ と QAT のワークフローと実用的なガイダンスを提供します。

ABSTRACT

While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additional noise it induces can lead to accuracy degradation. In this white paper, we present an overview of neural network quantization using AI Model Efficiency Toolkit (AIMET). AIMET is a library of state-of-the-art quantization and compression algorithms designed to ease the effort required for model optimization and thus drive the broader AI ecosystem towards low latency and energy-efficient inference. AIMET provides users with the ability to simulate as well as optimize PyTorch and TensorFlow models. Specifically for quantization, AIMET includes various post-training quantization (PTQ, cf. chapter 4) and quantization-aware training (QAT, cf. chapter 5) techniques that guarantee near floating-point accuracy for 8-bit fixed-point inference. We provide a practical guide to quantization via AIMET by covering PTQ and QAT workflows, code examples and practical tips that enable users to efficiently and effectively quantize models using AIMET and reap the benefits of low-bit integer inference.

研究の動機と目的

エッジデバイスにおける電力と遅延を削減するための量子化の必要性を動機づける。
AIMET を最先端の量子化と圧縮技術のライブラリとして提示する。
近似浮動小数点精度を 8-bit 推論で実現するための実用的な PTQ と QAT のワークフローとコード例を提供する。
定点アクセラレータにおける量子化のハードウェア背景と実務的な考慮事項を説明する。

提案手法

一様アフィン量子化（非対称）と per-tensor 対 per-channel の粒度を説明する。
PyTorch/TensorFlow モデルに対する AIMET 量子化シミュレーションワークフローと、エンコーディングがどのように計算されるかを説明する。
クロスレイヤー等化、レンジ設定、バイアス補正、AdaRound を含む標準的な PTQ パイプラインの詳細。
関連する結果と考慮事項を伴う QAT ワークフローと BN 折りたたみを概説する。
ターゲットハードウェア用のエンコーディングのエクスポートと JSON 設定によるシミュレーションオペレーションの構成について議論する。
量子化シミュレーション、バッチ正規化折りたたみ、エンコーディングのエクスポートを示す API 風のコード例を提供する。

実験結果

リサーチクエスチョン

RQ1どのような量子化戦略（PTQ と QAT）により 8-bit 推論で近い浮動小数点精度を実現できるか？
RQ2対称 vs 非対称、per-tensor vs per-channel など異なる量子化スキームが精度とハードウェア効率に与える影響はどうなるか？
RQ3一般的なモデルで堅牢な量子化を実現する実用的な手順（CLE、レンジ設定、AdaRound、バイアス補正）は何か？
RQ4AIMET がデバイス上での量子化をシミュレートし、オンターゲットのランタイム用にエンコーディングをエクスポートする方法は？
RQ5量子化推論のための特定のハードウェア制約を最もよく反映するワークフローと設定は何か？

主な発見

AIMET は 8-bit 推論のための近い浮動小数点精度を達成し得る PTQ および QAT ワークフローを実現する。
クロスレイヤー等化とバッチ正規化折りたたみは量子化性能を向上させ、特に depthwise/separable アーキテクチャで顕著。
重みの per-channel 量子化はハードウェアがサポートする場合に精度を向上させる可能性がある。一方で per-tensor 量子化は広くサポートされている。
AdaRound とバイアス補正は PTQ 内で低ビット重み量子化を可能にし、精度を改善するのに有効。
量子化エンコーディングはエクスポートされ、オンターゲットのランタイムで利用でき、再量子化せずにデプロイを支援する。
AIMET は PyTorch および TensorFlow のパイプラインと統合された、構成可能でフレームワークに依存しない量子化シミュレーションを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。