[Paper Review] NAS evaluation is frustratingly hard
The paper benchmarks 8 NAS methods across 5 datasets and introduces a relative-improvement metric over random architectures to separate search performance from training protocol and space design, finding that many methods offer little improvement over the average architecture baseline and that training protocol often dominates final accuracy.
Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of $8$ NAS methods on $5$ datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method's relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macro-structure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between $8$ and $20$ cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls. The code used is available at https://github.com/antoyang/NAS-Benchmark.
Motivation & Objective
- Assess whether NAS search strategies outperform randomly sampled architectures when controlling for training protocol and search space.
- Quantify the impact of training tricks and protocols on NAS performance.
- Investigate the contribution of search space, macro-structure, and seed on architecture ranking.
Proposed method
- Benchmark 8 NAS methods (DARTS, StacNAS, PDARTS, MANAS, CNAS, NSGANET, ENAS, NAO) on 5 datasets (CIFAR10, CIFAR100, SPORT8, MIT67, FLOWERS102).
- Sample 8 architectures randomly and compare to 8 architectures found by each method, under the same training protocol; compute relative improvement RI = 100*(Acc_m - Acc_r)/Acc_r.
- Use the average architecture from the respective search space as the baseline for RI.
- Analyze the effect of training protocols by comparing simple vs. augmented training approaches on CIFAR10 with DARTS space.
- Examine the DARTS search space with ablations on operations, macro-structure, seeds, and number of cells.
Experimental results
Research questions
- RQ1How much do NAS methods improve over randomly sampled architectures within the same search space and training protocol?
- RQ2How do training protocols influence final accuracy relative to architecture choice?
- RQ3What is the effect of seed and depth (number of cells) on architecture rankings in NAS?
- RQ4Are macro-architectural decisions (cell-level wiring) more impactful than micro-operations in NAS performance?
- RQ5Does the choice of search space limit the ability to find superior architectures across datasets?
Key findings
- Most NAS methods offer only small improvements over random sampling; some results are even below the average random architecture baseline.
- Training protocol differences can yield larger accuracy gains than architecture choices, with substantial improvements from tricks like Cutout, DropPath, AutoAugment, and longer training.
- Within the DARTS search space, randomly sampled architectures cluster tightly in performance, and seed and cell count substantially affect rankings (not just the final architecture).
- The macro-structure of the network outweighs the impact of the specific operations in final accuracy.
- Depth gaps (8 vs 20 cells) materially alter architecture rankings, indicating instability in weight-sharing NAS settings.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.