Skip to main content
QUICK REVIEW

[Paper Review] Learned Optimizers that Scale and Generalize

Olga Wichrowska, Niru Maheswaranathan|arXiv (Cornell University)|Mar 14, 2017
Advanced Neural Network Applications25 references115 citations
TL;DR

The paper presents a hierarchical RNN-based learned optimizer that generalizes to new tasks and scales to larger problems, achieving competitive performance with ADAM/RMSProp and extending to ImageNet-scale training for early iterations.

ABSTRACT

Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on. We release an open source implementation of the meta-training algorithm.

Motivation & Objective

  • Demonstrate that a learned gradient-descent optimizer can generalize to unseen tasks and architectures.
  • Reduce memory and computation overhead to enable scaling to larger problems.
  • Incorporate optimization-inspired features (attention, multi-timescale momentum, dynamic input scaling) into a learnable update rule.
  • Develop a diverse meta-training ensemble that captures common loss landscape properties.
  • Show that the optimizer can train larger models (ImageNet-scale) in early training steps.

Proposed method

  • Introduce a hierarchical RNN optimizer with per-parameter (Parameter RNN), tensor-level (Tensor RNN), and global (Global RNN) components.
  • Incorporate optimization-motivated features: attention-based extrapolation, multi-timescale momentum, dynamic input scaling, and decomposed update lengths.
  • Use gradient-based inputs including scaled gradients, momentum metrics, and relative learning-rate signals as RNN inputs.
  • Output per-parameter and per-tensor updates, plus adjustments to learning-rate logs, via learned affine readouts.
  • Meta-train the optimizer on a curated ensemble of small, diverse optimization tasks with a heavy-tailed schedule for training steps.
  • Use a meta-objective based on average log loss to encourage precise convergence and learning-rate adaptation.

Experimental results

Research questions

  • RQ1Can a learned optimizer generalize to neural architectures and problem classes not seen during meta-training?
  • RQ2How can memory and compute overhead be reduced to enable scaling to larger optimization problems?
  • RQ3Do optimization-informed architectural features help a learned optimizer generalize across tasks?
  • RQ4Does meta-training on a diverse set of small tasks yield robust performance on larger networks and datasets (e.g., ImageNet)?

Key findings

  • The hierarchical RNN optimizer achieves competitive performance with RMSProp/ADAM on problems from the meta-training set.
  • It generalizes to small ConvNets and fully connected nets not seen in meta-training, with comparable or better performance.
  • It can stabilize training for Inception V3 and ResNet V2 in early steps on ImageNet, though progress may slow later in training.
  • Memory and compute overhead scale favorably when keeping the Parameter RNN small, enabling larger-scale use cases.
  • Performance is robust to initial learning-rate choices, and ablations show the importance of key features (attention, multi-timescale momentum, scaling, relative learning rates).
  • Wall-clock time for the learned optimizer approaches that of standard optimizers as minibatch size increases.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.