Skip to main content
QUICK REVIEW

[论文解读] Rethinking the Value of Network Pruning

Zhuang Liu, Mingjie Sun|arXiv (Cornell University)|Oct 11, 2018
Anomaly Detection Techniques and Applications被引用 742
一句话总结

这篇论文表明,在结构化剪枝中,从头训练被剪枝的模型往往能达到或超过使用继承权重进行微调的效果,而剪枝后的体系结构本身是效率的关键驱动因素,暗示剪枝可以作为体系结构搜索。

ABSTRACT

Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization.

研究动机与目标

  • 质疑在剪枝之前必须先训练一个大型的过参数化模型的必要性。
  • 评估用继承权重微调剪枝模型是否优于从头训练剪枝模型。
  • 区分预定义目标 vs 自动发现目标的剪枝对的影响。
  • 评估剪枝是否主要作为体系结构搜索而非权重选择。
  • 将结构化剪枝与无结构剪枝进行比较,并将发现与 Lottery Ticket Hypothesis 联系起来。

提出的方法

  • 将剪枝分类为预定义目标体系结构和自动发现目标体系结构。
  • 从头训练剪枝模型(Scratch-E, Scratch-B) vs 从继承权重微调。
  • 应用多种剪枝方法(L1-norm 过滤剪枝, ThiNet, 基于回归的重构, Network Slimming, Sparse Structure Selection)以及一种无结构的基于幅值的剪枝。
  • 在 CIFAR-10, CIFAR-100, 和 ImageNet 上对 VGG, ResNet, DenseNet 变体进行评估。
  • 分析剪枝体系结构的参数效率和稀疏性模式。
  • 与 Lottery Ticket Hypothesis 进行比较并讨论对体系结构搜索的影响。

实验结果

研究问题

  • RQ1在预定义和自动剪枝目标下,继承权重微调的剪枝模型是否优于从头训练相同剪枝架构?
  • RQ2在最终效率和准确性方面,剪枝架构本身的作用有多大,而不是保留的权重?
  • RQ3剪枝是否能作为一种有效的体系结构搜索方法,在不进行大模型预训练的情况下产生参数高效的架构?
  • RQ4结构化剪枝和无结构剪枝在能否从头在像 ImageNet 这样的大规模数据集上训练剪枝模型方面有何差异?

主要发现

  • 对于预定义的结构化剪枝,Scratch-trained 模型达到或超越微调对等体的准确性,Scratch-B 往往优于 Scratch-E,有时在 ImageNet 上甚至优于微调。
  • 对于自动结构化剪枝,Scratch-trained 的剪枝模型通常等同于或优于微调模型,Scratch-B 常常更出色。
  • ImageNet 的无结构剪枝显示从头训练的表现比微调差,凸显与结构化剪枝的差异。
  • 通过自动剪枝方法获得的剪枝体系结构在参数效率上优于均匀剪枝的体系结构,表明具有体系结构搜索的价值。
  • 引导/剪枝的体系结构可以将设计模式迁移到其他模型/数据集,提示超过特定剪枝模型的实用设计原理。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。