Skip to main content
QUICK REVIEW

[论文解读] The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Rong Ge, Sham M. Kakade|arXiv (Cornell University)|Apr 29, 2019
Stochastic Gradient Optimization Techniques参考文献 61被引用 66
一句话总结

该论文表明,带有多项式衰减步长的 SGD 的最终迭代在流式最小二乘问题中是次优的,并引入一个步长衰减(几何)调度,几乎达到极小极大速率,在已知时限设定下近似最优,至多对数因子级别。

ABSTRACT

Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of $\sqrt{T}$ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only $log$ factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.

研究动机与目标

  • 表征带/不带强凸性的流式最小二乘问题中 SGD 的最终迭代行为。
  • 证明对于最终迭代而言,多项式衰减步长是次优的。
  • 提出并分析一个几何衰减的步长衰减调度,使之接近极小极大速率。
  • 将已知时限的结果与 SGD 的 anytime(极限)行为进行对比。
  • 在合成最小二乘和 CIFAR-10 的残差网络上提供实证验证。
  • 讨论在已知时限条件下对超参数调优的实际意义。

提出的方法

  • 形式化在最小二乘问题下带随机梯度 oracle 的 SGD,假设噪声和四阶矩协变量。
  • 定义步长:多项式衰减 eta_t ~ a/(b+t^alpha) 以及几何衰减的步长调度(算法 1)。
  • 推导下界,表明在强凸和非强凸情况下多项式衰减对最终迭代的次优性。
  • 证明上界,表明步长衰减达到近似极小极大速率,超额风险界与仅相差一个 log(T) 因子。
  • 给出在 CIFAR-10、采用 ResNet-44 的实验,比较衰减方案并讨论后缀平均的影响。

实验结果

研究问题

  • RQ1Can SGD’s final iterate match minimax rates for streaming least squares under fixed horizon T?
  • RQ2Do polynomially decaying stepsizes yield suboptimal final-iterate performance compared to step-decay schedules?
  • RQ3How close to minimax rates are achieved by step-decay schedules in strongly and non-strongly convex least squares?
  • RQ4What is the difference between known-horizon and anytime behavior of SGD’s final iterate?
  • RQ5Do empirical results on real-world networks support the theoretical benefits of step-decay schedules?

主要发现

  • Polynomially decaying stepsizes yield suboptimal final-iterate rates, with gaps scaling by the condition number (strongly convex) or by sqrt(T)/log T (non-strongly convex).
  • Step decay schedules achieve near minimax rates, with final-iterate excess risk off by only a log(T) factor in both strongly and non-strongly convex least squares under known horizon.
  • In the strongly convex case, lower bounds show any final iterate with polynomial decay incurs a κ factor suboptimality; in the smooth case, a √T/log T gap is shown.
  • The step-decay scheme requires only the initial learning rate and end time T for implementation; a refinement can reduce log factors to log(kappa) in the strongly convex case.
  • Anytime (limiting) behavior of SGD’s final iterate remains poor regardless of stepsize scheme, with limsup suboptimality bounded away from minimax rates.
  • Empirical results on CIFAR-10 with a ResNet-44 show continuous step-decay (exponential) often outperforms polynomial decays; suffix averaging can harm generalization in non-convex settings.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。