QUICK REVIEW

[논문 리뷰] NAS evaluation is frustratingly hard

Antoine Yang, Pedro M. Esperança|arXiv (Cornell University)|2019. 12. 28.

Molecular Biology Techniques and Applications참고 문헌 30인용 수 110

한 줄 요약

이 논문은 8개의 NAS 방법을 5개 데이터셋에서 벤치마크하고 탐색 성능을 훈련 프로토콜 및 공간 설계와 구분하기 위한 무작위 아키텍처 대비 상대 향상(relative-improvement) 지표를 도입한다. 많은 방법이 평균 아키텍처 기준선에 비해 큰 개선을 제공하지 못하며 훈련 프로토콜이 최종 정확도를 지배하는 경우가 많다.

ABSTRACT

Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of $8$ NAS methods on $5$ datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method's relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macro-structure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between $8$ and $20$ cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls. The code used is available at https://github.com/antoyang/NAS-Benchmark.

연구 동기 및 목표

NAS 검색 전략이 같은 검색 공간과 훈련 프로토콜 내에서 무작위로 샘플된 아키텍처를 능가하는지 평가한다.
훈련 트릭과 프로토콜이 NAS 성능에 미치는 영향을 정량화한다.
검색 공간, 매크로 구조, 시드가 아키텍처 랭킹에 미치는 기여를 조사한다.

제안 방법

8개의 NAS 방법(DARTS, StacNAS, PDARTS, MANAS, CNAS, NSGANET, ENAS, NAO)을 5개의 데이터셋(CIFAR10, CIFAR100, SPORT8, MIT67, FLOWERS102)에서 벤치마크한다.
같은 훈련 프로토콜에서 각 방법이 찾은 8개의 아키텍처와 무작위로 샘플링한 8개의 아키텍처를 비교하고 상대 향상 RI = 100*(Acc_m - Acc_r)/Acc_r를 계산한다.
RI의 기준선으로 각 검색 공간의 평균 아키텍처를 사용한다.
CIFAR10에서 DARTS 공간의 간단한 훈련과 확장된 훈련 접근법을 비교하여 훈련 프로토콜의 효과를 분석한다.
DARTS 검색 공간에서 연산, 매크로구조, 시드, 셀 수의 변화를 통한 ablations으로 분석한다.

실험 결과

연구 질문

RQ1같은 검색 공간과 훈련 프로토콜 내에서 NAS 방법이 무작위로 샘플된 아키텍처에 비해 얼마나 개선되는가?
RQ2훈련 프로토콜이 아키텍처 선택에 비해 최종 정확도에 어떤 영향을 미치는가?
RQ3시드와 깊이(셀 수)가 NAS 아키텍처 랭킹에 미치는 영향은 무엇인가?
RQ4매크로 구조(셀 간 배선) 결정이 NAS 성능에서 미세 연산보다 더 큰 영향을 미치는가?
RQ5데이터셋 간에 우수한 아키텍처를 찾는 능력이 탐색 공간의 선택으로 제한되는가?

주요 결과

대부분의 NAS 방법은 무작위 샘플링에 비해 작은 개선만을 제공합니다. 일부 결과는 평균 무작위 아키텍처 기준선 아래일 수 있습니다.
훈련 프로토콜의 차이가 아키텍처 선택보다 더 큰 정확도 향상을 가져올 수 있으며, Cutout, DropPath, AutoAugment 같은 트릭과 더 긴 학습에서 상당한 개선이 나타납니다.
DARTS 공간 내에서 무작위로 샘플링한 아키텍처는 성능이 촘촘히 모여 있고, 시드와 셀 수가 랭킹에 상당한 영향을 미칩니다(최종 아키텍처뿐만 아니라 랭킹에도).
네트워크의 매크로 구조가 최종 정확도에서 특정 연산의 영향보다 더 큰 영향을 미칩니다.
깊이 차이(8 vs 20 셀)가 아키텍처 랭킹을 실질적으로 바꾸며 가중치 공유 NAS 설정의 불안정성을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.