QUICK REVIEW

[論文レビュー] The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Siyuan Ma, Raef Bassily|arXiv (Cornell University)|Dec 18, 2017

Stochastic Gradient Optimization Techniques参考文献 20被引用数 38

ひとこと要約

この論文は、過パラメータ化されたモデルで訓練データを補間する場合、小さなミニバッチを用いた確率的勾配降下法（SGD）が急速に収束する理由を説明している。臨界ミニバッチサイズ $m^*$ を特定し、$m \leq m^*$ の場合、SGD はミニバッチサイズに線形に依存するが、$m > m^*$ の場合、性能が飽和する。これにより、補間領域におけるフル勾配降下法と比較して、計算量 $O(n)$ の加速が可能になる。

ABSTRACT

In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for fast convergence of SGD, comparable in number of iterations to full gradient descent. For convex loss functions we obtain an exponential convergence bound for {\it mini-batch} SGD parallel to that for full gradient descent. We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}). (b) SGD iteration with mini-batch $m> m^*$ is nearly equivalent to a full gradient descent iteration (\emph{saturation regime}). Moreover, for the quadratic loss, we derive explicit expressions for the optimal mini-batch and step size and explicitly characterize the two regimes above. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. We give experimental evidence on real data which closely follows our theoretical analyses. Finally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.

研究の動機と目的

現代の過パラメータ化学習において、モデルが訓練データを補間する中で、小規模ミニバッチ SGD がなぜ実用的に成功するかを説明すること。
訓練誤差がほぼゼロに近づく補間領域におけるミニバッチ SGD の収束速度を分析すること。
SGD 効率性における線形スケーリングと飽和の挙動の遷移を示す臨界ミニバッチサイズ $m^*$ を特定すること。
収束速度と計算効率に関する理論的境界を提示し、SGD が反復回数においてフル勾配降下法と同等になることを示すこと。
理論的知見を実践的技術（例：深層学習における線形スケーリング則）と結びつけること。

提案手法

最適解が訓練誤差をゼロに達する補間仮定の下で、凸損失関数を分析する。
ミニバッチ SGD の指数的収束境界を導出し、ミニバッチサイズ $m$ とステップサイズへの依存を示す。
臨界ミニバッチサイズ $m^*$ を $m^* \approx \frac{\lambda_1}{\beta}$ で定義し、2つの領域（線形スケーリング：$m \leq m^*$、飽和：$m > m^*$）に分ける。
バリアンス低減技術とヘッセ行列のスペクトル解析を用いて、過パラメータ化された設定における収束速度を特徴付ける。
2次損失の場合の最適ミニバッチサイズとステップサイズの明示的表現を導出する。
カーネル法と深層学習を用いた MNIST、TIMIT、HINT-S データセットにおける実験を通じて、理論的予測の妥当性を検証する。

実験結果

リサーチクエスチョン

RQ1理論的収束速度が遅いにもかかわらず、なぜ実際の運用では小規模ミニバッチ SGD がフル勾配降下法を上回るのか？
RQ2過パラメータ化とデータ補間は、なぜ高速な SGD 収束を可能にするのか？
RQ3SGD 効率性における線形スケーリングと飽和の遷移を示す臨界ミニバッチサイズ $m^*$ は何か？
RQ4補間領域におけるミニバッチサイズの変化が、SGD の計算効率にどのように影響するか？
RQ5深層学習で広く用いられる線形スケーリング則は、補間設定において理論的に正当化できるか？

主な発見

補間領域における凸損失関数に対して、ミニバッチ SGD は反復回数がフル勾配降下法と同等の指数的収束を達成する。
臨界ミニバッチサイズ $m^*$ が存在し、$m \leq m^*$ の場合、ミニバッチサイズ $m$ の SGD は $m=1$ の SGD で $m$ 回反復したのとほぼ同等になる（線形スケーリング領域）。
$m > m^*$ の場合、ミニバッチサイズを増やすことで得られる利点は次第に小さくなり、性能が飽和し収束が遅くなる（飽和領域）。
臨界ミニバッチサイズ $m^*$ はデータサイズ $n$ にほぼ依存せず、計算量 $O(n)$ の加速が、フル勾配降下法と比較して可能になる。
2次損失の場合、最適ミニバッチサイズとステップサイズの明示的式が導出され、2領域の挙動が確認された。
MNIST、TIMIT、HINT-S における実験結果は、訓練誤差のプロファイルが理論的予測とよく一致しており、異なるカーネルやデータ分布間で類似した相対的効率が得られている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。