QUICK REVIEW

[论文解读] SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov, Frank Hutter|arXiv (Cornell University)|Aug 13, 2016

Domain Adaptation and Few-Shot Learning参考文献 28被引用 1,741

一句话总结

引入带余弦退火暖启动的 stochastic gradient descent（SGD）以提高深度网络的训练速度和泛化能力，在 CIFAR-10/100 上达到最先进的结果，并实现有效的快照集成。

ABSTRACT

Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR

研究动机与目标

Motivate and address the slow anytime performance of SGD in deep neural network training.
Propose a simple warm restart mechanism with cosine annealing to improve convergence speed.
Demonstrate improvements on CIFAR-10/100 and show benefits on EEG data and a downsampled ImageNet.
Explore ensemble gains from snapshots taken during SGDR trajectories.
Highlight potential for enabling faster architecture exploration and training efficiency.

提出的方法

通过在预定义的间隔重新启动，重新以更高的学习率开始来模拟 SGD 中的暖启动，但不重置模型权重。
在每次重新启动内，使用从最大值到最小值的余弦退火对学习率进行 T_i 轮的变化： eta_t = eta_min^i + 0.5*(eta_max^i - eta_min^i)*(1 + cos(T_cur/T_i * pi)).
通过乘数 T_mult 使 T_i 增长，以改善 anytime performance 并实现更快速达到良好测试误差。
使用单次或少量的 SGDR 运行，在重启之间保持相同的 eta_max/min，以减少超参数调谐。
可选地对在重新启动前的 SGDR 快照进行集成，以形成具有更高准确性的集成模型。
将 SGDR 与标准学习率计划进行比较，并在 WRN 架构上复现基线结果。

实验结果

研究问题

RQ1相比于标准 SGD 计划，SGDR 是否能在达到目标测试误差所需时间上提高训练效率？
RQ2带余弦退火暖启动和随 T_i 增长的重启是否能实现更快的收敛和更好的泛化？
RQ3从 SGDR 路径中获得的快照集成是否在多数设置中对单一运行模型或来自独立运行的集成提供显著提升？
RQ4SGDR 的增益是否可迁移到 CIFAR 以外的领域（如 EEG 数据）以及降采样的 ImageNet 配置？
RQ5在速度与准确性之间取得平衡的实际超参数（初始学习率、T_i、T_mult）是什么？

主要发现

在 CIFAR-10 上实现更快达到具有竞争力的测试误差（约 4% 区域）和 CIFAR-100（约 20% 区域）的速度相较于默认计划，SGDR 表现突出。
由 SGDR 快照形成的集成实现了接近 state-of-the-art 的改进（例如在 CIFAR-10 上测试误差 3.14%，CIFAR-100 为 16.21%，N=16 运行和 M=3 快照）。
SGDR 使更宽的网络（WRN-28-20）在与较窄网络使用标准计划相同或更短预算内达到更好准确度。
来自 SGDR 快照的集成成员多样且有用，在许多设置中优于由独立运行构建的等效集成。
初步实验表明 SGDR 提高了 EEG 数据集和降采样的 ImageNet 的性能，指示其更广泛的适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。