[Paper Review] Multi-Scale Dense Networks for Resource Efficient Image Classification
MSDNet introduces a single CNN with dense, multi-scale intermediate classifiers to enable anytime and budgeted batch image classification, reusing computation and maintaining high accuracy under test-time resource constraints.
In this paper we investigate image classification with computational resource limits at test time. Two such settings are: 1. anytime classification, where the network's prediction for a test example is progressively updated, facilitating the output of a prediction at any time; and 2. budgeted batch classification, where a fixed amount of computation is available to classify a set of examples that can be spent unevenly across "easier" and "harder" inputs. In contrast to most prior work, such as the popular Viola and Jones algorithm, our approach is based on convolutional neural networks. We train multiple classifiers with varying resource demands, which we adaptively apply during test time. To maximally re-use computation between the classifiers, we incorporate them as early-exits into a single deep convolutional neural network and inter-connect them with dense connectivity. To facilitate high quality classification early on, we use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network. Experiments on three image-classification tasks demonstrate that our framework substantially improves the existing state-of-the-art in both settings.
Motivation & Objective
- Motivate resource-constrained image classification at test time (anytime and budgeted batch scenarios).
- Develop a single CNN architecture enabling adaptive computation without retraining for different budgets.
- Ensure early-exit classifiers reuse computation while maintaining high final accuracy through dense connectivity and multi-scale features.
Proposed method
- Introduce a cascade of intermediate classifiers connected via dense connections to reuse features across classifiers.
- Employ a two-dimensional, multi-scale architecture that maintains coarse and fine features throughout the network.
- Attach classifiers only to the coarsest scale and train with a weighted sum of cross-entropy losses.
- Use dense connectivity to prevent early exits from degrading later classifier performance.
- Adopt lazy evaluation and scale-aware network reduction to further reduce computation.
- Train end-to-end with budget-aware thresholds controlling exits during testing.
Experimental results
Research questions
- RQ1Can a single CNN architecture support adaptive computation for anytime prediction and budgeted batch classification without retraining?
- RQ2Do dense connectivity and multi-scale feature maps enable effective early exits without harming the final classifier?
- RQ3How does MSDNet perform under strict computational budgets compared with state-of-the-art CNNs and ensembles?
- RQ4What are the trade-offs between keeping multiple scales versus computational cost in resource-constrained inference?
Key findings
- MSDNet substantially outperforms ResNet and DenseNet ensembles at all budgets in anytime prediction on ImageNet and CIFAR-100.
- MSDNet achieves ~4–8% higher accuracy than baselines at budgets of 0.1×10^10–0.3×10^10 FLOPs on ImageNet.
- With an average budget of 1.7×10^9 FLOPs, MSDNet reaches ~75% top-1 accuracy on ImageNet, about 6% higher than a ResNet with the same budget.
- MSDNet uses 2–3× fewer FLOPs than DenseNets to achieve the same accuracy in budgeted batch classification on ImageNet.
- Early exits are effective when combined with dense connectivity and multi-scale features, with overall final accuracy becoming largely independent of exit location.
- MSDNet can match the performance of a deep ensemble at a fraction of the computation and supports precise budget control across easy and hard images.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.