[Paper Review] Learning to Learn without Gradient Descent by Gradient Descent
The paper trains recurrent neural network optimizers on synthetic functions to perform fast, transferable black-box optimization, rivaling and sometimes surpassing Bayesian optimization methods in various settings, including hyper-parameter tuning and control tasks.
We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.
Motivation & Objective
- Motivate fast, general-purpose black-box optimization beyond Bayesian methods.
- Develop meta-learned optimizers that learn exploration-exploitation trade-offs.
- Demonstrate transfer of learned optimizers to derivative-free problems across domains.
- Show computational gains over standard BO packages in training-horizon scenarios.
Proposed method
- Model a black-box optimizer as an RNN with shared parameters that updates its hidden state and proposes the next query point.
- Train the RNN by backpropagating through time using a loss that sums objective values over a finite horizon (L_sum).
- Experiment with losses that encourage exploration, such as expected improvement (EI) and observed improvement (OI).
- Train function distributions are generated from Gaussian process priors to provide differentiable training signals.
- Extend the framework to parallel evaluations by augmenting inputs with a feedback flag and simulating out-of-order completions.
- Compare learned optimizers to Spearmint, TPE, and SMAC, and evaluate on transfer tasks including GP bandits, control, and hyper-parameter tuning.
- Use differentiable architectures (DNC and LSTM) for the optimizer and assess their speed at test time.
Experimental results
Research questions
- RQ1Can a learned RNN-based optimizer, trained on simple synthetic functions, effectively optimize a wide range of black-box functions?
- RQ2Do learned optimizers transfer to derivative-free optimization domains beyond their training distribution?
- RQ3How do different meta-learning losses (sum, EI, OI) influence exploration-exploitation balance and performance?
- RQ4What are the computational advantages of learned optimizers relative to established Bayesian optimization packages?
- RQ5Can parallel evaluation be integrated into the learned optimization framework without performance loss?
Key findings
- Learned RNN optimizers transfer to GP bandits, control objectives, global optimization benchmarks, and ML hyper-parameter tuning.
- DNC-based optimizers trained with EI or OI losses outperform direct-observation DNCs and are competitive with, and often faster than, Spearmint, SMAC, and TPE within a 100-step horizon.
- Optimizers are orders of magnitude faster than traditional BO methods at test time (rough runtime improvements of up to 10^4× in reported cases).
- With higher input dimensions, learned optimizers outperform baseline BO methods in the training horizon.
- Parallel proposal schemes maintain performance while offering substantial speedups in hyper-parameter tuning scenarios.
- The approach achieves competitive results on standard benchmarks and simple control problems, often matching engineered optimizers.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.