QUICK REVIEW

[论文解读] Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD

Tianbao Yang, Yan Yan|arXiv (Cornell University)|Dec 10, 2018

Stochastic Gradient Optimization Techniques被引用 3

一句话总结

本文通过提出一种分阶段正则化训练算法，解释了为何分阶段训练能加速神经网络优化中的收敛过程。该算法在每个阶段使用几何递减的学习率和显式正则化。对于满足Polyak-Łojasiewicz条件的损失函数（包括凸函数和弱凸函数），其在训练误差和测试误差上的收敛速度均快于原始SGD，且测试误差界不依赖于维度和参数范数。

ABSTRACT

Stagewise training strategy is commonly used for learning neural networks, which uses a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomial decaying step size in terms of both training error and testing error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider the stagewise training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and nice-behaviored non-convex loss functions that are close to a convex function (namely weakly convex functions), we establish faster convergence of stagewise training than the vanilla SGD under the same condition on both training error and testing error. Indeed, the proposed algorithm has additional favorable features that come with theoretical guarantee for the considered non-convex optimization problems, including using explicit algorithmic regularization at each stage, using stagewise averaged solution for restarting, and returning the last stagewise averaged solution as the final solution. To differentiate from commonly used stagewise SGD, we refer to our algorithm as stagewise regularized training algorithm. Of independent interest, the proved testing error bounds for a family of non-convex loss functions are dimensionality and norm independent.

研究动机与目标

解释分阶段训练相比使用多项式学习率衰减的原始SGD，为何能加速测试误差收敛的实验观察。
为在Polyak-Łojasiewicz条件下最小化经验风险时，分阶段训练实现更快收敛提供理论依据。
为神经网络中常见的非凸损失函数建立不依赖于维度和参数范数的测试误差界。
提出并分析一种分阶段正则化训练算法，包含每个阶段的显式正则化、各阶段的平均解，以及基于最后一轮平均迭代的最终解。

提出的方法

该算法在各阶段采用几何递减学习率调度的随机梯度下降。
在每个阶段应用显式算法正则化，以稳定优化并改善泛化性能。
在每个阶段内对解进行平均，并将该平均值作为下一阶段的起始点。
最终输出为最后一个分阶段平均解，其被证明具有更优的泛化性能。
理论分析基于Polyak-Łojasiewicz条件，该条件适用于许多神经网络损失函数及类似凸的非凸函数。
分析推导出不依赖于数据维度和参数范数的一般化误差界，这是非凸设置下的新结果。

实验结果

研究问题

RQ1为何分阶段训练相比使用多项式学习率衰减的原始SGD能加速测试误差收敛？
RQ2在何种条件下分阶段训练可在训练误差和测试误差上均实现更快收敛？
RQ3能否为非凸损失函数推导出不依赖于数据维度和参数范数的一般化误差界？
RQ4每个阶段的显式正则化在提升收敛性和泛化性能方面起到什么作用？
RQ5跨阶段使用平均解如何促进更优的测试误差表现？

主要发现

在Polyak-Łojasiewicz条件下，分阶段正则化训练算法在训练误差和测试误差上的收敛速度均快于原始SGD。
所提方法为一大类非凸损失函数（包括神经网络中典型的损失函数）提供了不依赖于维度和参数范数的测试误差界。
每个阶段的显式正则化有助于提升泛化性能并实现稳定的收敛。
使用各阶段的平均解作为重启起点可提升性能并确保理论保证。
最终解（取为最后一个分阶段平均迭代）实现了更优的泛化性能，并具有可证明的误差界。
理论分析证实，分阶段训练中观察到的更快收敛在数学上可由Polyak-Łojasiewicz条件下的优化动力学所解释。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。