QUICK REVIEW

[论文解读] Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Aojun Zhou, Yukun Ma|arXiv (Cornell University)|Feb 8, 2021

Advanced Neural Network Applications参考文献 43被引用 74

一句话总结

本工作从零开始使用 SR-STE 训练 N:M 细粒度结构化稀疏网络，实现硬件友好的稀疏性，在 Nvidia A100 上可带来高达约 2x 的加速，同时保持精度，并引入 SAD 指标以分析稀疏拓扑的变化。

ABSTRACT

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2:4 sparse network could achieve 2x speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.

研究动机与目标

激励在 GPU 上结合非结构化稀疏和结构化稀疏以加速深度神经网络。
提出一个从零开始训练 N:M 稀疏网络的框架，在不产生显著性能损失的情况下。
引入 SR-STE 以在训练过程中减轻梯度引发的架构扰动。
定义 Sparse Architecture Divergence (SAD) 用于量化训练过程中的拓扑变化。
展示在视觉任务和机器翻译等领域的有效性。

提出的方法

定义 N:M 稀疏性，在每组连续的 M 个权重中至多有 N 个非零。
扩展 Straight-through Estimator (STE) 以在训练期间实现联机剪枝的反向传播。
引入 Sparse Architecture Divergence (SAD) 来衡量训练过程中的拓扑变化。
提出 Sparse-refined STE (SR-STE)，带有一个正则化项，对被剪枝权重进行惩罚，以在训练过程中稳定架构。
在图像分类、目标检测、实例分割、光流和机器翻译上进行评估；并与 ASP、STE 及其他稀疏方法进行比较。

实验结果

研究问题

RQ1我们能从零开始训练 N:M 稀疏网络而不牺牲性能吗？
RQ2SR-STE 是否降低剪枝权重梯度不匹配并在训练中稳定稀疏架构？
RQ3不同的 N:M 模式（如 2:4、4:8、1:4、2:8）在各任务上如何影响准确率和加速？
RQ4所提出的方法是否保留稀疏模型向下游任务的可迁移性？

主要发现

Model	Method	Sparse Pattern	Top-1 Acc(%)	Params(M)	Flops(G)
ResNet50	Dense	-	77.3	25.6	4.09
ResNet50	SR-STE	2:4	77.0	13.8	2.15
ResNet50	SR-STE	4:8	77.4	13.8	2.15
ResNet50	SR-STE	1:4	75.3	7.93	1.17
ResNet50	SR-STE	2:8	76.2	7.93	1.17
ResNet50 x1.25	SR-STE	2:8	77.5	11.8	1.79

2:4 稀疏网络在与密集基线相比时，在 ImageNet 的 ResNet-50 上可实现约 2x 的加速，同时精度损失可以忽略。
4:8 稀疏（同样的 50% 稀疏）在 ImageNet 的 ResNet-50 上在相近 FLOPs 下性能超过 2:4。
SR-STE 在 ImageNet 上对多种模式（如 2:4、4:8）持续提升 Top-1 准确率，优于 STE 和 ASP 基线。
在 COCO 目标检测中，2:8 稀疏的 mAP 接近密集基线，4:8 甚至可以超越密集表现的 Faster R-CNN with ResNet-50。
在光流（RAFT）和神经机器翻译（Transformer）中，SR-STE 在参数和 FLOPs 显著减少的情况下达到与密集模型相当的性能。
SAD 指标与性能相关，当 SR-STE 稳定稀疏架构时该指标降低。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。