QUICK REVIEW

[論文レビュー] Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD

Tianbao Yang, Yan Yan|arXiv (Cornell University)|Dec 10, 2018

Stochastic Gradient Optimization Techniques被引用数 3

ひとこと要約

この論文は、段階的学習率の幾何級数的減少と各段階での明示的正則化を用いる段階的正則化学習アルゴリズムを提案することで、ニューラルネットワーク最適化における段階的学習が収束を加速する理由を説明している。損失関数がポリャク＝ロジャシエフスキー条件（Polyak-Łojasiewicz condition）を満たす場合——凸関数や弱凸関数を含む——、vanilla SGD よりも訓練誤差およびテスト誤差の両方でより速い収束を達成し、次元およびノルムに依存しないテスト誤差バウンドを達成する。

ABSTRACT

Stagewise training strategy is commonly used for learning neural networks, which uses a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomial decaying step size in terms of both training error and testing error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider the stagewise training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and nice-behaviored non-convex loss functions that are close to a convex function (namely weakly convex functions), we establish faster convergence of stagewise training than the vanilla SGD under the same condition on both training error and testing error. Indeed, the proposed algorithm has additional favorable features that come with theoretical guarantee for the considered non-convex optimization problems, including using explicit algorithmic regularization at each stage, using stagewise averaged solution for restarting, and returning the last stagewise averaged solution as the final solution. To differentiate from commonly used stagewise SGD, we refer to our algorithm as stagewise regularized training algorithm. Of independent interest, the proved testing error bounds for a family of non-convex loss functions are dimensionality and norm independent.

研究の動機と目的

多項式的学習率スケジューリングを用いるvanilla SGDと比較して、段階的学習がテスト誤差の収束を速めるという経験的観察を説明すること。
ポリャク＝ロジャシエフスキー条件の下で、経験的リスクを最小化する段階的学習のより速い収束を理論的に裏付けること。
ニューラルネットワークで一般的に見られる非凸損失関数に対して、次元およびノルムに依存しないテスト誤差バウンドを確立すること。
各段階での明示的正則化、各段階での平均化された解、および最後の段階の平均反復に基づく最終解を備えた段階的正則化学習アルゴリズムを導入・分析すること。

提案手法

アルゴリズムは、段階ごとに幾何級数的に減少する学習率スケジュールを用いた確率的勾配降下法を適用する。
各段階で、最適化の安定化と一般化性能の向上を目的とした明示的アルゴリズム正則化を適用する。
各段階内で解を平均化し、次の段階の出発点としてその平均を用いる。
最終出力は、最後の段階的平均解であり、一般化性能の向上が示されている。
理論的分析は、多くのニューラルネットワーク損失関数および凸的非凸関数に成立するポリャク＝ロジャシエフスキー条件に依存する。
次元およびパrameterノルムに依存しない一般化バウンドが導出され、非凸設定においては画期的な結果である。

実験結果

リサーチクエスチョン

RQ1なぜ段階的学習は、多項式的学習率スケジューリングを用いるvanilla SGDと比較して、テスト誤差の収束を速くするのか？
RQ2段階的学習が訓練誤差およびテスト誤差の両方でより速い収束を達成できる条件は何か？
RQ3非凸損失関数に対して、次元およびパrameterノルムに依存しない一般化バウンドを導出できるか？
RQ4各段階での明示的正則化が収束性および一般化性能の向上に果たす役割は何か？
RQ5段階間で平均化された解を用いることで、なぜより良いテスト誤差性能が達成されるのか？

主な発見

ポリャク＝ロジャシエフスキー条件の下で、段階的正則化学習アルゴリズムはvanilla SGDに比べて、訓練誤差およびテスト誤差の両方でより速い収束を達成する。
提案手法は、ニューラルネットワークで一般的に見られる非凸損失関数の広いクラスに対して、次元およびノルムに依存しないテスト誤差バウンドを提供する。
各段階での明示的正則化は、一般化性能の向上と安定した収束に寄与する。
各段階での平均化された解を用いたリスタートは性能向上に寄与し、理論的保証を確立する。
最後の段階の平均反復として得られる最終解は、証明可能な誤差バウンドを伴い、優れた一般化性能を達成する。
理論的分析により、段階的学習で経験的に観察されるより速い収束が、ポリャク＝ロジャシエフスキー条件下での最適化ダイナミクスに数学的根拠を持つことが確認された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。