QUICK REVIEW

[论文解读] Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

Xiaohan Ding, Guiguang Ding|arXiv (Cornell University)|Sep 27, 2019

Speech and Audio Processing被引用 125

一句话总结

GSM 通过全局选择有限数量的参数在训练中主动更新，同时通过基于动量的权重衰减将其他参数推向零，实现无损剪枝无需再训练并具备自动逐层稀疏发现。

ABSTRACT

Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices. DNN pruning is an approach for deep model compression, which aims at eliminating some parameters with tolerable performance degradation. In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning. Concretely, given a global compression ratio, we categorize all the parameters into two parts at each training iteration which are updated using different rules. In this way, we gradually zero out the redundant parameters, as we update them using only the ordinary weight decay but no gradients derived from the objective function. As a departure from prior methods that require heavy human works to tune the layer-wise sparsity ratios, prune by solving complicated non-differentiable problems or finetune the model after pruning, our method is characterized by 1) global compression that automatically finds the appropriate per-layer sparsity ratios; 2) end-to-end training; 3) no need for a time-consuming re-training process after pruning; and 4) superior capability to find better winning tickets which have won the initialization lottery.

研究动机与目标

推动模型压缩，以使其能够在资源受限的设备上部署而不造成较大精度损失。
开发一种端到端的剪枝方法，直接控制全局压缩比。
消除逐层超参数调优和剪枝后再训练的需求。
在训练过程中实现对逐层稀疏比的自动发现。
展示 GSM 能在深度网络中找到强大的 winning tickets，并实现无损剪枝。

提出的方法

使用全局压缩比 C 和 Q = |Theta|/C，将 SGD 更新分成主动部分和被动部分。
在每次迭代计算一阶泰勒基的参数重要性度量 T(x,y,w) = |(∂L/∂w) w|。
应用激活选择以保留前-Q个参数作为活跃参数（使用梯度），其余参数仅通过权重衰减被动更新。
使用带有掩码 B^(k) 的动量 SGD 来实现被动更新以及对被剪枝连接的偶发重新激活。
允许隐式重新激活和对许多参数持续减小至零，而无需显式微调。
在训练后通过保留前-Q个幅值参数实现全局剪枝。
通过将 GSM 找到的票据与基于幅值的票据进行比较，展示改进的 winning tickets。

实验结果

研究问题

RQ1是否可以在端到端训练中直接控制全局压缩比，以在不损失准确度的情况下实现高稀疏？
RQ2基于动量的两部分更新如何影响剪枝速度、准确性和逐层稀疏分布？
RQ3GSM 是否能够实现隐式连接重新激活并在剪枝后避免代价高昂的重新训练？
RQ4GSM 找到的 winning tickets 是否比基于幅值的剪枝得到的更有效？
RQ5GSM 能否有效剪枝非常深的网络（例如 ResNet-50、DenseNet-40）以及大规模数据集（ImageNet）？

主要发现

GSM 实现高水平的压缩（例如 LeNet-5 上最高可达 125x，在 CIFAR-10/ResNet-56/DenseNet-40 上为 8–10x），同时精度损失很小甚至没有。
GSM 自动发现逐层稀疏度，使剪枝与层敏感性对齐，无需手动超参数调优。
动量加速了冗余参数的置零，促使更快收敛到稀疏状态。
训练过程中的重新激活有助于从早期剪枝错误中恢复，保持准确性。
在若干实验中（如 LeNet-5、LeNet-300），GSM 比基于幅值的剪枝发现更强的 winning tickets。
在类似条件下，GSM 在 ResNet-50 剪枝方面超越了先前的方法（L-OBS）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。