QUICK REVIEW

[Paper Review] Exploring loss function topology with cyclical learning rates

Leslie N. Smith, Nicholay Topin|arXiv (Cornell University)|Feb 14, 2017

Machine Learning and Algorithms4 references17 citations

TL;DR

This paper investigates the topology of neural network loss functions using cyclical learning rates (CLR) and linear network interpolation, revealing counterintuitive behaviors such as simultaneous increases in test loss and accuracy, and super-convergence—where networks trained with large learning rates achieve higher test accuracy in fewer iterations than standard training. The authors demonstrate that CLR exposes distinct minima across cycles, and interpolation between these solutions yields improved generalization through regularization.

ABSTRACT

We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at https://github.com/lnsmith54/exploring-loss

Motivation & Objective

To investigate previously unreported training dynamics in deep neural networks using cyclical learning rates and learning rate range tests.
To understand the underlying structure of neural network loss functions by observing how training behaviors change with varying learning rates.
To explore whether distinct minima are found across cycles of CLR and whether interpolation between them improves model generalization.
To assess the robustness of neural network architectures based on the range of learning rates yielding high test accuracy.
To evaluate the potential of weight interpolation between solutions as a regularization technique.

Proposed method

Applying cyclical learning rates (CLR) with a triangular policy, where the learning rate oscillates between a minimum and maximum value over a fixed number of iterations (stepsize).
Conducting learning rate range tests by linearly increasing the learning rate from a small initial value to a large one throughout training to map the network's convergence behavior across a wide LR spectrum.
Using linear network interpolation to compare trained network weights by computing weighted averages of two sets of trained weights: net_new = α*net_1 + (1−α)*net_2 for varying α values.
Measuring training and test loss and accuracy during interpolation to detect whether solutions correspond to the same or different minima.
Analyzing training trajectories and loss function behavior under CLR to identify anomalies such as loss increases concurrent with accuracy gains.
Comparing standard training with fixed learning rates to CLR training to evaluate convergence speed and final model performance.

Experimental results

Research questions

RQ1How does the behavior of training loss and accuracy change when using cyclical learning rates, particularly when the learning rate crosses critical thresholds?
RQ2Can distinct minima be identified across cycles of cyclical learning rate training, and do they correspond to different generalization capabilities?
RQ3Why do test loss and test accuracy sometimes increase simultaneously, defying conventional expectations of inverse correlation?
RQ4To what extent can interpolation between solutions from different cycles improve model generalization and reduce test loss?
RQ5Does the range of learning rates yielding high test accuracy correlate with architectural robustness in deep networks?

Key findings

Cyclical learning rates with a stepsize of 10,000 iterations enabled super-convergence, achieving 93% test accuracy on CIFAR-10 with ResNet-56 in just 20,000 iterations—surpassing standard training with an initial learning rate of 0.35, which reached only 91% accuracy.
A sharp four-order-of-magnitude increase in training loss occurred at a learning rate of approximately 0.255 during CLR training, yet training convergence resumed at higher learning rates, indicating complex loss function topology.
Simultaneous increases in both test loss and test accuracy were observed during multiple cycles of CLR, defying the typical inverse relationship and suggesting non-monotonic behavior in the loss landscape.
Interpolation between solutions from different CLR cycles revealed a central minimum in test loss, indicating that averaging weights from distinct minima improves generalization and acts as a form of regularization.
The loss function topology revealed by learning rate range tests showed a broad range of learning rates (0.25 to 1.0) yielding consistently high test accuracy, suggesting that architectures with such ranges may be more robust to hyperparameter choice.
Solutions found at the end of each CLR cycle were distinct, as confirmed by interpolation showing a 'peak' in loss between them, indicating they correspond to different minima in the loss landscape.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.