QUICK REVIEW

[論文レビュー] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao|arXiv (Cornell University)|Oct 1, 2015

Advanced Neural Network Applications参考文献 22被引用数 3,526

ひとこと要約

三段階パイプライン—剪定、重み共有を用いた訓練済み量子化、そして Huffman コーディング—を導入し、精度低下なしで深層ネットワークを圧縮し、オンチップストレージとエネルギー効率を実現します。

ABSTRACT

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

研究の動機と目的

モバイルおよび組み込み展開のための深層ニューラルネットワークのストレージとメモリ帯域幅要件を削減する。
モデルパラメータを大幅に圧縮しつつ元の精度を維持する。
モデルサイズをオンチップメモリに収まるように縮小してオンチップ SRAM キャッシュを有効にする。
ImageNetとMNISTで複数のアーキテクチャ（LeNet、AlexNet、VGG-16）にわたる圧縮効果を示す。

提案手法

重要度の低い接続を削除するネットワーク剪定を実行し、残る重みを再訓練する。
訓練済み量子化を適用して重みをクラスタリングし、コードブックとインデックスを小さく保存して重み共有を作成する。
量子化後に共有重みを微調整するため再訓練する。
非均一な重み・インデックス分布を活用して追加の圧縮を図るため Huffman コーディングを適用する。
MNISTとImageNetのベンチマークで圧縮を評価し、ストレージ削減と精度を報告する。

実験結果

リサーチクエスチョン

RQ1剪定は大規模CNNで冗長な接続を除去して精度を損なわずに済むのか？
RQ2訓練済み量子化による重み共有は性能を維持しつつどの程度ストレージを削減できるのか？
RQ3Huffman コーディングは剪定と量子化を超える追加の圧縮を提供するのか、もしそうならどの程度か？
RQ4実機上での Deep Compression の実用的なストレージ・速度・エネルギーへの影響は？
RQ5これらの技術はアーキテクチャ（LeNet、AlexNet、VGG-16）とデータセット（MNIST、ImageNet）間でどのように相互作用するのか？

主な発見

ネットワーク全体で精度を損なうことなくモデルストレージを35×から49×削減。
AlexNet は 240MB から 6.9MB（35×削減）；VGG-16 は 552MB から 11.3MB（49×削減）。
剪定だけでパラメータを9×から13×削減；量子化により接続あたりのビット数が32から最小5まで削減；Huffman コーディングはさらに20%–30%の圧縮を追加。
剪定と量子化は補完的で、精度を損なうことなく元のサイズのおよそ3%程度まで組み合わせて到達できる。
圧縮によりオンチップ SRAM ストレージを可能にし、エネルギーを削減してモバイル展開を可能にする。非バッチ推論では3×–4×のスピードアップと3×–7×のエネルギー効率向上を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。