QUICK REVIEW

[論文レビュー] Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee|arXiv (Cornell University)|Oct 12, 2018

Stochastic Gradient Optimization Techniques参考文献 78被引用数 46

ひとこと要約

この論文は、明示的L2正則化を用いるとニューラルネットは一般化が向上し、O(d)サンプル程度で学習できる一方、NTKベースのカーネルは Omega(d^2)サンプルを要する可能性があると示す。また正規化を伴う無限幅極限での最適化の多項式時間収束を証明している。

ABSTRACT

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $Ω(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

研究の動機と目的

過parameterizationと明示的正則化が一般化に与える影響をNTK分析を超えて動機づける。
正則化されたネットがO(d)サンプルで成功する一方、NTKがOmega(d^2)サンプルで失敗する具体的なデータ分布を示す。
弱い正則化とmax-margin解を結ぶ理論ツールを開発し、マージンベースの一般化境界を証明する。
無限幅正則化ネットワークが perturbed Wasserstein gradient flow によって多項式時間で全局最小値へ収束することを示す。

提案手法

信号が最初の二つの座標に集中するd次元の分布Dを構築する。
L2正則化ロジスティック損失で訓練された二層ReLUネットとアーキテクチャにより誘導されるNTKカーネルを分析する。
弱い正則化下での正則化NNがmax-margin解へ収束し、良好に一般化することを証明する。
摂動されたWasserstein勾配フローを導入し、無限幅ネットワークの全局最小値への多項式時間収束を証明する。

実験結果

リサーチクエスチョン

RQ1明示的L2正則化はNTKカーネルよりも良いマージンと一般化を実現できるか？
RQ2構築されたデータ分布における正則化ニューラルネットとNTKベースの方法のサンプル複雑度のギャップはどれくらいか？
RQ3正規化されたグローバル最適解は無限幅極限で効率的な最適化により達成可能か？
RQ4弱い正則化は深いアーキテクチャ全体で最適化をmax-margin解へ押しやるか？

主な発見

正則化ニューラルネットは構築された分布でO(d)サンプルで良い一般化を達成する一方、NTKは Omega(d^2)サンプルを要する。
弱い正則化ロジスティック損失のグローバル最適化解は同じアーキテクチャのネットワークの中で最大正規化マージンを達成する。
幅の過剰パラメータ化には利得があり：ネットワーク幅が大きいほど最大マージンは非減少し、一般化境界を改善する。
無限幅の二層ネットワークでは、ノイズ付き勾配降下法が正則化損失を多項式時間で全局最小値へ最適化する。
経験的シミュレーションは、明示的正則化によるマージンとテスト精度の改善を、正則化なしのネットに比べて裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。