QUICK REVIEW

[Paper Review] Gradient-based Hyperparameter Optimization through Reversible Learning

Dougal Maclaurin, David Duvenaud|arXiv (Cornell University)|Feb 11, 2015

Machine Learning and Data Classification31 references403 citations

TL;DR

This paper introduces a method to compute exact gradients of cross-validation loss with respect to hyperparameters by exactly reversing the dynamics of stochastic gradient descent with momentum. By storing only a minimal amount of auxiliary information, the approach reduces memory usage by up to 200×, enabling efficient optimization of thousands of hyperparameters—including learning rate schedules, initialization distributions, and regularization schemes—demonstrating state-of-the-art performance on neural network hyperparameter tuning.

ABSTRACT

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

Motivation & Objective

Address the challenge of hyperparameter optimization in machine learning, where gradients are typically unavailable due to the inner training loop.
Overcome the memory bottleneck in reverse-mode differentiation of hyperparameters, which traditionally requires storing the entire training trajectory.
Enable efficient, exact gradient computation through stochastic gradient descent with momentum using reversible learning dynamics.
Facilitate the automatic tuning of complex, high-dimensional hyperparameter spaces, including learning rate schedules, initialization distributions, and regularization schemes.
Provide a scalable framework for hyperparameter optimization that supports rich, structured hyperparameterization of models and training procedures.

Proposed method

Propose a reversible learning framework that exactly reverses the steps of stochastic gradient descent with momentum by storing only a small number of auxiliary variables.
Use exact arithmetic to reverse the training dynamics, allowing backpropagation through the entire training process without storing intermediate parameter states.
Introduce a memory-efficient algorithm that reduces storage requirements by a factor of 200 compared to standard reverse-mode differentiation when momentum is 0.9.
Chain gradients backward through the entire training procedure using the reversed dynamics to compute exact hypergradients with respect to all continuous hyperparameters.
Apply the method to compute gradients of validation loss with respect to hyperparameters such as learning rate schedules, weight initialization distributions, and per-input regularization.
Leverage the exact reversibility of the training dynamics to avoid checkpointing and reduce memory footprint while maintaining computational accuracy.

Experimental results

Research questions

RQ1Can exact gradients of validation loss with respect to hyperparameters be computed efficiently despite the presence of an inner training loop?
RQ2To what extent can the memory cost of hypergradient computation be reduced by exploiting the reversibility of stochastic gradient descent with momentum?
RQ3Can this method scale to optimize thousands of hyperparameters simultaneously, including complex, structured schedules and initialization schemes?
RQ4How do the optimized hyperparameters compare to standard heuristics in the literature, and what insights do they provide into learning dynamics?
RQ5Is it feasible to use this method for end-to-end hyperparameter optimization across diverse model architectures and training procedures?

Key findings

The proposed method enables exact computation of hypergradients through stochastic gradient descent with momentum by exactly reversing the training dynamics, eliminating the need to store the entire training trajectory.
Memory usage is reduced by a factor of up to 200 compared to standard reverse-mode differentiation when momentum is set to 0.9, making large-scale hyperparameter optimization feasible.
The method successfully optimizes thousands of hyperparameters simultaneously, including fine-grained learning rate schedules, per-layer weight initialization distributions, and per-pixel data preprocessing schemes.
Optimized learning rate schedules and initialization procedures revealed non-intuitive patterns that deviate from standard heuristics, offering new insights into effective training dynamics.
The approach enables automatic, gradient-based tuning of model architecture, regularization, and training procedures, achieving state-of-the-art performance on benchmark tasks.
The framework is generalizable to other momentum-based optimization methods such as RMSprop and Adam, suggesting broader applicability beyond the specific case studied.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.