[Paper Review] Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods
This paper proposes an iterative hard thresholding (IHT) method to train Skinny Deep Neural Networks (SDNNs) with significantly fewer parameters while improving generalization and reducing model size. The approach alternates between hard thresholding to prune low-magnitude connections and fine-tuning the remaining weights, followed by reactivating and retraining all connections, achieving state-of-the-art performance on CIFAR-10, CIFAR-100, MNIST, and ImageNet with up to 4× parameter reduction.
Deep neural networks have achieved remarkable success in a wide range of practical problems. However, due to the inherent large parameter space, deep models are notoriously prone to overfitting and difficult to be deployed in portable devices with limited memory. In this paper, we propose an iterative hard thresholding (IHT) approach to train Skinny Deep Neural Networks (SDNNs). An SDNN has much fewer parameters yet can achieve competitive or even better performance than its full CNN counterpart. More concretely, the IHT approach trains an SDNN through following two alternative phases: (I) perform hard thresholding to drop connections with small activations and fine-tune the other significant filters; (II)~re-activate the frozen connections and train the entire network to improve its overall discriminative capability. We verify the superiority of SDNNs in terms of efficiency and classification performance on four benchmark object recognition datasets, including CIFAR-10, CIFAR-100, MNIST and ImageNet. Experimental results clearly demonstrate that IHT can be applied for training SDNN based on various CNN architectures such as NIN and AlexNet.
Motivation & Objective
- To address the dual challenges of overfitting and high memory/computation cost in deep neural networks.
- To develop a method that reduces model size without sacrificing performance, especially under high compression rates.
- To improve generalization capability of compressed networks through iterative pruning and retraining.
- To enable efficient deployment of deep models on memory-constrained devices like mobile phones.
Proposed method
- The method alternates between two phases: hard thresholding to retain only the top-k weight parameters by magnitude and zeroing out the rest.
- In Phase I, the network is fine-tuned on the remaining active connections to recover performance after pruning.
- In Phase II, previously frozen connections are reactivated and the entire network is jointly trained to improve representation learning.
- The process iteratively applies these two phases to progressively refine the sparse network structure.
- Hard thresholding is applied per layer, preserving only the most significant filters based on weight magnitude.
- The approach is applied to various architectures including NIN and AlexNet, with explicit size constraints enforced during training.
Experimental results
Research questions
- RQ1Can iterative hard thresholding improve generalization in deep neural networks while reducing model size?
- RQ2Does pruning via hard thresholding followed by retraining yield better performance than standard pruning or regularization?
- RQ3Can SDNNs trained with IHT achieve state-of-the-art accuracy with significantly fewer parameters?
- RQ4How does the method scale across datasets of varying complexity, such as MNIST, CIFAR-10/100, and ImageNet?
- RQ5Does the IHT-based training strategy maintain or improve performance even at high compression ratios?
Key findings
- On CIFAR-10, SDNN-2× reduced error rate by 2.42% compared to NIN while using only half the parameters.
- On CIFAR-100, SDNN-2× achieved a 5.18% lower error rate than NIN with data augmentation and 3.19% without, despite a smaller model size.
- On MNIST, SDNN-2× achieved 0.19% error rate with only 0.18M parameters, outperforming NIN (0.47% error with 0.35M parameters).
- On ImageNet, SDNN-2× reduced top-5 error rate by 1.66% compared to the baseline AlexNet, while reducing parameters by 50%.
- SDNN-4× achieved a 0.81% lower error rate than the baseline AlexNet with 15M parameters, outperforming prior pruning methods at the same size.
- The method consistently improved performance across all datasets and architectures, even at high compression ratios, demonstrating superior generalization.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.