QUICK REVIEW

[Paper Review] Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

Hesham Mostafa, Xin Wang|arXiv (Cornell University)|Feb 15, 2019

Machine Learning and Data Classification123 citations

TL;DR

The paper introduces a novel dynamic sparse reparameterization method to train deep CNNs with a fixed parameter budget, outperforming static and dynamic baselines and matching or exceeding accuracy of post-training compression in experiments on CIFAR-10 and ImageNet.

ABSTRACT

Modern deep neural networks are typically highly overparameterized. Pruning techniques are able to remove a significant fraction of network parameters with little loss in accuracy. Recently, techniques based on dynamic reallocation of non-zero parameters have emerged, allowing direct training of sparse networks without having to pre-train a large dense model. Here we present a novel dynamic sparse reparameterization method that addresses the limitations of previous techniques such as high computational cost and the need for manual configuration of the number of free parameters allocated to each layer. We evaluate the performance of dynamic reallocation methods in training deep convolutional networks and show that our method outperforms previous static and dynamic reparameterization methods, yielding the best accuracy for a fixed parameter budget, on par with accuracies obtained by iteratively pruning a pre-trained dense model. We further investigated the mechanisms underlying the superior generalization performance of the resultant sparse networks. We found that neither the structure, nor the initialization of the non-zero parameters were sufficient to explain the superior performance. Rather, effective learning crucially depended on the continuous exploration of the sparse network structure space during training. Our work suggests that exploring structural degrees of freedom during training is more effective than adding extra parameters to the network.

Motivation & Objective

Motivate parameter-efficient training under a fixed memory budget for deep CNNs.
Develop a dynamic sparse reparameterization method that reallocates non-zero parameters during training.
Benchmark against static sparse, dynamic reparameterization, and compression baselines across CNNs and datasets.
Investigate mechanisms behind the generalization gains from dynamic structural exploration during training.

Proposed method

Represent networks with sparse parameter tensors where non-zeros are optimized via gradient descent and their locations are reallocated during training.
Use a two-phase cycle of magnitude-based pruning and random growth to move free parameters within and across layers.
Maintain a fixed total number of non-zero parameters by adaptively adjusting pruning thresholds with a global threshold H.
Redistribute newly freed parameters across layers according to a heuristic that favors layers with larger loss gradients and sparser structure.
Compare dynamic sparse reparameterization against full dense, thin dense, static sparse, compressed sparse, DeepR, SET, and HashedNet baselines on CIFAR-10 and ImageNet.

Experimental results

Research questions

RQ1Can deep CNNs be effectively trained with a fixed budget of parameters using dynamic sparse reparameterization?
RQ2Does adaptive cross-layer reallocation of non-zero weights during training improve generalization over static sparsity or post-training pruning?
RQ3Is the dynamic exploration of network structure during training necessary for achieving high generalization, beyond final sparse structure or initialization?
RQ4What are the emergent sparsity patterns across layers and blocks when using dynamic sparse training?

Key findings

Dynamic sparse training yields better generalization than static reparameterization at the same parameter budget, and often matches or surpasses post-training compression baselines.
Final sparsity patterns show that larger parameter tensors tend to become sparser and deeper layers tend to be sparser.
The approach incurs negligible computational overhead relative to competing dynamic methods and can reallocate parameters across layers automatically.
The superior performance stems from ongoing structural exploration during training rather than solely the final sparse structure or initialization.
Stopping dynamic reallocation after some initial epochs still achieves convergence, indicating early structural exploration is crucial.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.