QUICK REVIEW

[论文解读] Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Leslie N. Smith, Nicholay Topin|arXiv (Cornell University)|Aug 23, 2017

Stochastic Gradient Optimization Techniques被引用 106

一句话总结

本文提出超收敛，显示神经网络在使用非常大学习率的循环学习率 CLR 计划下可以更快训练，同时减少其他正则化。它在多个数据集和结构上提供了经验证据，并给出一个受 Hessian-free 启发的估计最优学习率的方法。

ABSTRACT

In this paper, we describe a phenomenon, which we named "super-convergence", where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet datasets, and resnet, wide-resnet, densenet, and inception architectures. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence. See http://www.fast.ai/2018/04/30/dawnbench-fastai/ for an application of super-convergence to win the DAWNBench challenge (see https://dawn.cs.stanford.edu/benchmark/).

研究动机与目标

证明相比于标准学习率调度，较大学习率能显著加速神经网络训练。
表明较大学习率具有正则化作用，并需要降低其他形式的正则化以保持平衡。
提供一种通过简化的 Hessian-free 方法来估计并利用最优学习率的实用方法。
在多个数据集（Cifar-10/100、MNIST、ImageNet）和架构（ResNet、Wide-ResNet、DenseNet、Inception）上验证超收敛。

提出的方法

使用带有 one-cycle 政策（1cycle）的循环学习率（CLR），学习率从较低值扫到较大最大值后再在剩余迭代中衰减。
应用 LR range test 以识别适用于 CLR 的峰值学习率。
简化 Hessian-free 优化以估计自适应的逐权重学习率代理，并证明较大学习率与广阔、平坦的极小值相关。
在使用较大学习率时通过减少其他形式的正则化（如权重衰减）来实现正则化平衡。
在多数据集/架构上将超收敛与标准分段恒定学习率调度进行比较。
报告在数据可用性和批量大小变化下的性能差异。

实验结果

研究问题

RQ1能否在非常大的学习率和循环调度下加速训练而不牺牲最终准确性？
RQ2在使用大学习率时应如何调整其他正则化以维持最佳正则化平衡？
RQ3较大学习率是否会导致广阔、平坦的极小值，如何在实际中估计合适的学习率？
RQ4超收敛是否在多样的数据集（CIFAR、MNIST、ImageNet）和架构（ResNet、DenseNet、Inception、Wide-ResNet）中可观测？
RQ5数据可用性（标记数量有限）如何影响超收敛的收益？

主要发现

与标准训练相比，超收敛在更少迭代次数内可达到更高的最终测试准确率（例如 CIFAR-10 与 ResNet-56：在 10,000 次迭代中达到 92.4%，而在 80,000 次迭代中为 91.2%。）
使用 one-cycle 学习率（最大 lr 可达 3 或更高）进行训练在最终准确性上优于传统调度，即使总迭代次数不同（例如 CIFAR-10/ResNet-56：在 6,000 次迭代时达到 92.1%）。
较大学习率对训练具有正则化作用，需要降低其他正则化（如权重衰减）以平衡正则化效果。
简化的 Hessian-free 方法表明在某些架构中，训练过程中的估计学习率落在 2–6 的区间，表明使用大学习率训练时会发现广阔、平坦的极小值。
当标注数据有限时，超级收敛的增益更明显，减少训练时间同时提升或保持准确率。
在 ImageNet 上，降低权重衰减使得可以使用大学习率和 1cycle 策略，达到更高的 top-1 准确率（例如 ResNet-50：67.6%，WD 3e-6 到 1e-5； Inception-ResNet-v2：74.0% 的 WD 3e-6）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。