QUICK REVIEW

[論文レビュー] Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Aojun Zhou, Yukun Ma|arXiv (Cornell University)|Feb 8, 2021

Advanced Neural Network Applications参考文献 43被引用数 74

ひとこと要約

この研究は SR-STE を用いて N:M の細粒度の構造化スパースネットワークをゼロから訓練し、ハードウェアに優しいスパース性を実現、Nvidia A100 で最大で約2倍のスピードアップを達成しつつ精度を維持、さらにスパーストポロジーの変化を分析する SAD 指標を導入します。

ABSTRACT

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2:4 sparse network could achieve 2x speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.

研究の動機と目的

GPU上でDNNを加速するために、非構造と構造のスパース性を組み合わせることを動機づける。
大きな性能低下を招かずに、ゼロからN:Mスパースネットワークを訓練するためのフレームワークを提案する。
訓練中の勾配誘起のアーキテクチャの摂動を緩和するためにSR-STEを導入する。
訓練中のトポロジー変化を定量化するためにSparse Architecture Divergence (SAD)を定義する。
視覚タスクと機械翻訳全般で有効性を示す。

提案手法

連続した重みの長さMのグループごとに、非ゼロの重みがN個以下となるN:Mスパース性を定義する。
訓練中のオンラインプルーニングのための逆伝播を可能にするようにStraight-through Estimator (STE)を拡張する。
訓練中のトポロジー変化を測定するためにSparse Architecture Divergence (SAD)を導入する。
訓練中のアーキテクチャを安定化させるために、剪定された重みにペナルティを課す正則化項を備えたSparse-refined STE (SR-STE)を提案する。
画像分類、物体検出、インスタンスセグメンテーション、光学フロー、機械翻訳で評価し、ASP、STE、その他のスパース性手法と比較する。

実験結果

リサーチクエスチョン

RQ1性能を犠牲にすることなく、ゼロからN:Mスパースネットワークを訓練できるか？
RQ2SR-STEは剪定後の重み勾配の不一致を低減し、訓練中のスパースアーキテクチャを安定化するか？
RQ3異なるN:Mパターン（例：2:4、4:8、1:4、2:8）が、タスク全体で精度と速度向上にどう影響するか？
RQ4提案手法は、下流タスクへのスパースモデルの転移性を維持するか？

主な発見

Model	Method	Sparse Pattern	Top-1 Acc(%)	Params(M)	Flops(G)
ResNet50	Dense	-	77.3	25.6	4.09
ResNet50	SR-STE	2:4	77.0	13.8	2.15
ResNet50	SR-STE	4:8	77.4	13.8	2.15
ResNet50	SR-STE	1:4	75.3	7.93	1.17
ResNet50	SR-STE	2:8	76.2	7.93	1.17
ResNet50 x1.25	SR-STE	2:8	77.5	11.8	1.79

2:4スパースネットワークは、Denseベースラインと比較してImageNetのResNet-50でほとんど精度を失うことなく、Nvidia A100 GPUで約2倍のスピードアップを達成できる。
4:8のスパース性（同じ50%のスパース性）は、ImageNetのResNet-50で同程度のFLOPsで2:4を上回る。
SR-STEは、ImageNetで複数のパターン（例：2:4、4:8）において、STEおよびASPベースラインを一貫してTop-1精度で上回る。
COCO物体検出では、2:8 スパース性がdenseベースラインに近いmAPを示し、4:8はResNet-50を用いたFaster R-CNNでdense性能を上回ることさえある。
光学フロー（RAFT）とニューラル機械翻訳（Transformer）では、SR-STEはパラメータ数とFLOPsを大幅に削減しつつ、denseモデルと同等の性能を達成する。
SAD指標は性能と相関し、SR-STEがスパースアーキテクチャを安定化させると低下する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。