QUICK REVIEW

[论文解读] A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Leslie N. Smith|arXiv (Cornell University)|Mar 26, 2018

Advanced Neural Network Applications参考文献 18被引用 822

一句话总结

本文提出实用且高效的方法，通过分析验证/测试损失来设定学习率、批大小、动量和权重衰减，使用循环学习率/动量，并平衡正则化以加速训练同时提升性能。

ABSTRACT

Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentums. Files to help replicate the results reported here are available.

研究动机与目标

通过有纪律的超参数调优来减少训练时间并提升性能。
在训练早期利用训练、验证/测试损失诊断欠拟合/过拟合。
展示学习率、动量、批大小和权重衰减之间的相互依赖，以及如何平衡它们。
介绍 cyclical learning rates (CLR) 和 cyclical momentum (CM) 以及 1cycle policy 加速收敛。
为从业者提供实用指南和可复现的资源。

提出的方法

在训练早期分析训练和验证/测试损失，以指导超参数调整。
采用 cyclical learning rate (CLR) 区间测试来确定合适的学习率界限。
使用 1cycle learning rate policy 以在较大学习率下实现快速收敛。
研究 cyclical momentum (CM) 及其与 CLR 的相互作用以稳定训练。
评估权重衰减及其与 LR 和 CM 在不同数据集和体系结构中的平衡。
为从业者提供可复现的文件和实用指南。

实验结果

研究问题

RQ1从业者在不进行穷尽网格搜索的情况下，如何高效确定最佳学习率、批大小、动量和权重衰减？
RQ2在训练过程中，验证/测试损失的哪些早期指标可揭示欠拟合或过拟合？
RQ3循环学习率和循环动量如何相互作用以影响收敛速度和稳定性？
RQ4在不同体系结构和数据集中，权重衰减在平衡正则化与其他超参数方面的作用是什么？

主要发现

验证/测试损失提供关于收敛性和泛化能力的信息，这在训练损失或准确率中并不总是可见。
LR 区间测试有助于识别可用的最大学习率和 CLR 的最佳学习率界限。
在降低其他正则化以平衡正则化时，较大学习率可实现更快的训练（超收敛）。
批大小与学习率及硬件约束相关；在接近常数执行时间的前提下，较大批大小可以提高最终准确度，超出一定点回报递减。
结合 CLR 的循环动量通常比恒定动量在鲁棒性和最终性能上更好，特别是对于如 ResNet-56 这样的更深网络。
权重衰减应与学习率和动量平衡；最优值取决于数据集和体系结构，并且在与 CLR/CM 联合探索中受益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。