QUICK REVIEW

[論文レビュー] Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

Elias Frantar, Singh, Sidak Pal|arXiv (Cornell University)|Aug 24, 2022

Medical Imaging Techniques and Applications被引用数 36

ひとこと要約

「Optimal Brain Compressor (OBC)」を導入する、ワンショットのポストトレーニング剪定と量子化のための統合かつ厳密な OBS ベースのフレームワークで、圧縮下の精度を向上させ、ジョイント剪定-量子化をサポートする。

ABSTRACT

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] extended to also cover weight quantization at the scale of modern DNNs. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can enable the accurate compound application of both pruning and quantization in a post-training setting.

研究の動機と目的

学習済み DNN をワンショットのポストトレーニング設定で、限られた較正データを用いて圧縮するという課題に対処する。
剪定と量子化の両方を効率的かつ正確に扱う統一フレームワークを開発する。
実用的な時間と空間効率を持つ厳密な OBS ベースの手法を現代的ネットワークに提供する。
最小の精度低下で高いスピードアップを得るために剪定と量子化を組み合わせる。

提案手法

圧縮制約の下でレイヤー出力の変化を最小化することとして、レイヤーごとの圧縮を定式化する。
Squared-error layer-wise loss に対して Optimal Brain Surgeon (OBS) フレームワークを適用し、厳密な貪欲剪定アルゴリズムを得る。
ExactOBS を開発し、行ごとヘッセ行列の処理と行/列削除更新によって O(d_row * d_col^2) 時間と O(d_col^2) メモリで一度に一つの重みを剪定。
OBS を重み量子化へ拡張し、Optimal Brain Quantizer (OBQ) を作成。量子化する重みを損失影響に基づいて選択し、残りの重みに閉形式の更新を適用。
剪定と量子化を統合して Optimal Brain Compressor (OBC) を作成し、N:M とブロックスパース性の実用的拡張とグループ更新のオプションを実装。
効率的で厳密な実装と再現性のための公開リポジトリを提供。）

実験結果

リサーチクエスチョン

RQ1単一の厳密な OBS ベースのフレームワークを、剪定と量子化の両方のポストトレーニング設定に効果的に適用できるか？
RQ2レイヤーごとのワンショット圧縮アプローチは、再訓練なしで実用的な FLOP/待機時制約の下で競争力のある精度をもたらすか？
RQ3剪定と量子化を組み合わせて、GPU および CPU 環境の両方で最小の精度低下でより大きなスピードアップを達成できるか？
RQ4DNN スケールで厳密な剪定と量子化を可能にするために、2次情報を効率的に計算・更新する方法は？

主な発見

ExactOBS アルゴリズムは、最新の DNN スケールでレイヤーを剪定する際の厳密な貪欲剪定解を、単一 GPU 上で大幅に計算量を削減して達成する。
OBS ベースのアプローチは、重みを 1 つずつ反復的に量子化することで拡張でき、Optimal Brain Quantizer (OBQ) を生み出し、剪定と統合して Optimal Brain Compressor (OBC) に統合される。
OBC は、イメージ分類、物体検出、言語モデリングタスクを横断して、ポストトレーニング剪定と量子化の圧縮トレードオフで最先端の精度を達成する。
複合圧縮（剪定＋量子化）は、例えば GPU シナリオで 2% の精度喪失で理論的演算を 12 倍削減、CPU シナリオでは 1% の損失で実行時速度を 4 倍向上させるなど、実質的なスピードアップを提供する。
レイヤーごとの圧縮を、較正データと組み合わせて実施することで、より高価なグローバル最適化手法の結果に近づくか凌駕することが、ポストトレーニング設定で確認できる。
このフレームワークは非一様圧縮と N:M、ブロックスパース性などの実用的なスパース性パターンをサポートし、層ごとの制約に DP ベースのレイヤー単位ソルバーで適応可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。