QUICK REVIEW

[论文解读] Picking Winning Tickets Before Training by Preserving Gradient Flow

Chaoqi Wang, Guodong Zhang|arXiv (Cornell University)|Feb 18, 2020

Advanced Neural Network Applications参考文献 37被引用 148

一句话总结

GraSP 在初始化阶段通过保留梯度流来裁剪神经网络，在 ImageNet 上实现高达 80% 的权重裁剪，且几乎不损失准确率。

ABSTRACT

Overparameterization has been shown to benefit both the optimization and generalization of neural networks, but large networks are resource hungry at both training and test time. Network pruning can reduce test-time resource requirements, but is typically applied to trained networks and therefore cannot avoid the expensive training process. We aim to prune networks at initialization, thereby saving resources at training time as well. Specifically, we argue that efficient training requires preserving the gradient flow through the network. This leads to a simple but effective pruning criterion we term Gradient Signal Preservation (GraSP). We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet, using VGGNet and ResNet architectures. Our method can prune 80% of the weights of a VGG-16 network on ImageNet at initialization, with only a 1.6% drop in top-1 accuracy. Moreover, our method achieves significantly better performance than the baseline at extreme sparsity levels.

研究动机与目标

在训练前进行裁剪以节省训练资源的动机。
提出基于梯度流的裁剪标准，考虑权重之间的相互依赖。
在 CIFAR、Tiny-ImageNet 和 ImageNet 上结合 VGGNet 和 ResNet 架构展示有效性。
分析裁剪如何影响训练动力学，并将发现与 Neural Tangent Kernel 联系起来。

提出的方法

引入 Gradient Signal Preservation (GraSP) 作为裁剪准则。
计算 Hessian-梯度乘积以评估剪裁一个权重如何影响梯度流。
通过 S(-θ) = -θ ⊙ (H g) 对权重进行打分，然后裁剪分数最低的前 p 分之一。
从初始化开始训练得到的稀疏网络并评估性能。
利用 NTK 的见解将裁剪效应与优化动力学联系起来。

实验结果

研究问题

RQ1是否可以在初始化阶段有效裁剪网络而无需对完整密集模型进行训练？
RQ2在高稀疏度下，裁剪时保留梯度流是否能提升可训练性和最终准确性？
RQ3GraSP 与 SNIP 及其他基线在现代架构和数据集上的对比情况如何？
RQ4初始化和批大小在 GraSP 性能中的作用？

主要发现

GraSP 在 ImageNet 的初始化阶段可裁剪高达 80% 的 VGG-16 权重，且 top-1 准确率仅下降 1.6%。
GraSP 在极端稀疏度下持续优于 SNIP，覆盖 CIFAR-10/100、Tiny-ImageNet 和 ImageNet。
GraSP 相较随机裁剪在梯度流保持方面表现更好，且常接近或超过 late-reset lottery tickets 及部分 DST 基线。
使用 GraSP 裁剪的网络在训练中损失下降更快，且在高稀疏度下梯度范数更好，优于 SNIP。
GraSP 与基于 NTK 的预测一致，鼓励在输出空间梯度中的高方差方向被保留以实现高效优化。
GraSP 对不同初始化和批大小表现出鲁棒性，特别是在像 Kaiming 这样的常见初始化下。）

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。