[论文解读] NAS evaluation is frustratingly hard
该论文在5个数据集上对8种NAS方法进行基准测试,并引入相对于随机架构的相对改进度量,以将搜索性能与训练协议和空间设计分离,结果发现许多方法对平均架构基线的改进很少,且训练协议往往主导最终准确率。
Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of $8$ NAS methods on $5$ datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method's relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macro-structure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between $8$ and $20$ cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls. The code used is available at https://github.com/antoyang/NAS-Benchmark.
研究动机与目标
- 评估在控制训练协议和搜索空间的情况下,NAS搜索策略是否优于随机采样的架构。
- 量化训练技巧和协议对NAS性能的影响。
- 研究搜索空间、宏观结构和种子对架构排序的贡献。
提出的方法
- 在5个数据集上基准测试8种NAS方法(DARTS、StacNAS、PDARTS、MANAS、CNAS、NSGANET、ENAS、NAO)。(CIFAR10、CIFAR100、SPORT8、MIT67、FLOWERS102)
- 在相同训练协议下,随机抽取8个架构并与每种方法找到的8个架构进行比较,计算相对改进RI = 100*(Acc_m - Acc_r)/Acc_r。
- 以各自搜索空间中的平均架构作为RI的基线。
- 通过比较在DARTS空间中简单训练与增强训练的方法,分析训练协议的影响;在CIFAR10上使用DARTS空间进行。
- 对DARTS搜索空间进行消融,考察操作、宏观结构、种子和单元数量的影响。
实验结果
研究问题
- RQ1NAS方法在同一搜索空间和训练协议下相对随机采样的架构有多大改进?
- RQ2训练协议如何影响相对于架构选择的最终准确率?
- RQ3种子和深度(单元数量)对NAS中架构排序有何影响?
- RQ4宏观结构(单元级连线)是否比微观操作对NAS性能的影响更大?
- RQ5搜索空间的选择是否限制了在不同数据集上找到更优架构的能力?
主要发现
- 大多数NAS方法对随机采样的改进很小;有些结果甚至低于平均随机架构基线。
- 训练协议差异往往带来比架构选择更大的准确率提升,使用如Cutout、DropPath、AutoAugment等技巧以及更长的训练时间可带来显著改进。
- 在DARTS搜索空间内,随机采样的架构在性能上聚集密集,种子和单元数量显著影响排序(不仅是最终的架构)。
- 网络的宏观结构在最终准确率上超过特定操作的影响。
- 深度差距(8个对20个单元)会实质性改变架构排序,表明在权重共享的NAS设置中不稳定。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。