[Paper Review] Neural Architecture Optimization
NAO learns continuous embeddings of architectures via an encoder-predictor-decoder trio and optimizes architectures with gradient steps in embedding space, yielding competitive NAS results with reduced computation.
Automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Existing methods, no matter based on reinforcement learning or evolutionary algorithms (EA), conduct architecture search in a discrete space, which is highly inefficient. In this paper, we propose a simple and efficient method to automatic neural architecture design based on continuous optimization. We call this new approach neural architecture optimization (NAO). There are three key components in our proposed approach: (1) An encoder embeds/maps neural network architectures into a continuous space. (2) A predictor takes the continuous representation of a network as input and predicts its accuracy. (3) A decoder maps a continuous representation of a network back to its architecture. The performance predictor and the encoder enable us to perform gradient based optimization in the continuous space to find the embedding of a new architecture with potentially better accuracy. Such a better embedding is then decoded to a network by the decoder. Experiments show that the architecture discovered by our method is very competitive for image classification task on CIFAR-10 and language modeling task on PTB, outperforming or on par with the best results of previous architecture search methods with a significantly reduction of computational resources. Specifically we obtain 1.93% test set error rate for CIFAR-10 image classification task and 56.0 test set perplexity of PTB language modeling task. Furthermore, combined with the recent proposed weight sharing mechanism, we discover powerful architecture on CIFAR-10 (with error rate 2.93%) and on PTB (with test set perplexity 56.6), with very limited computational resources (less than 10 GPU hours) for both tasks.
Motivation & Objective
- Motivate automatic neural architecture design to improve search efficiency over discrete-space RL/EA methods.
- Propose a continuous-space NAS framework (NAO) to embed, predict, and decode architectures.
- Show that gradient-based optimization in embedding space can yield architectures with strong performance and transferable results.
Proposed method
- Encode neural architectures into a continuous embedding using a one-layer LSTM encoder.
- Predict architecture performance with a regression model trained on dev-set accuracy.
- Decode embeddings back to discrete architectures with an LSTM decoder with attention to recover strings.
- Optimize embeddings by gradient ascent on the predictor output to obtain new embeddings likely yielding better architectures.
- Train encoder, predictor, and decoder jointly with a multi-task objective combining prediction loss and architecture-reconstruction loss.
Experimental results
Research questions
- RQ1Can continuous embeddings of discrete architectures enable efficient gradient-based optimization for NAS?
- RQ2How well can an encoder-predictor-decoder trio predict and improve architecture performance across CIFAR-10, PTB, and transfer tasks?
- RQ3Does NAO produce architectures competitive with or superior to prior NAS methods while reducing computational resources?
- RQ4Is the discovered architecture transferable to other datasets (CIFAR-100, ImageNet, WikiText-2)?
Key findings
- NAO discovers architectures achieving 1.93% test error on CIFAR-10 (with cutout) and 56.0 perplexity on PTB, competitive with or better than prior NAS methods.
- With weight sharing, NAO reaches 2.93% error on CIFAR-10 and 56.6 perplexity on PTB using under 10 GPU-hours.
- Transferring NAO-found architectures to CIFAR-100 and ImageNet yields strong results (CIFAR-100: 14.75% error; ImageNet top-1: 25.7%).
- NAO+weight sharing can find competitive architectures with fewer evaluated models (e.g., 1000 vs. 20000 in Table comparisons).
- The encoder achieves >78% pairwise accuracy in predictor quality with ~500 training architectures; the decoder almost exactly recovers architectures (average Hamming distance < 0.5 tokens).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.