QUICK REVIEW

[論文レビュー] A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, Yuanzhi Li|arXiv (Cornell University)|Nov 9, 2018

Reinforcement Learning in Robotics参考文献 55被引用数 627

ひとこと要約

本論文は、ランダム初期化から SGD/勾配降下法で訓練された過剰パラメータ化された深層ニューラルネットワークが、初期化の大きな近傍における近凸性と NTK 等価性を示すことにより、mild assumptionsの下で多項式時間内に training error をゼロに達成できる（または training accuracy を 100% にできる）ことを証明する。

ABSTRACT

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

研究の動機と目的

深層ネットワークが非凸・非滑らかな目的関数を持つにもかかわらず、1次法によって訓練が成功する理由の理論的理解を動機付ける。
過剰パラメータ化された深層ネットワークをランダム初期化から多項式時間で訓練し、訓練誤差をゼロにできることを示す。
2層から多層ネットワークへ、ReLU活性化やさまざまなアーキテクチャを含む過剰パラメータ化の理論を拡張する。
有限の多項式幅における過剰パラメータ化ネットワークとNTKの関係を確立する。
完全連結、CNN、および残差ネットワークアーキテクチャに対して、緩いデータ仮定の下で適用可能な枠組みを提供する。

提案手法

ReLU活性化を用いるL層完全連結ネットワークのℓ2回帰に基づく訓練ダイナミクスを分析する（他の損失関数にも拡張可能）。
初期付近で目的関数がほぼ凸で半滑になることを証明し、SGD/GDが多項式時間で収束する。
有限幅における過剰パラメータ化ネットワークとNTKの等価性を、無限幅ではなくm = poly(L)のときに示す。
ReLUの非滑らさを扱うために符号行列D_i,ℓを用いた勾配式とバックプロパゲーション構造を導出。
前方・後方伝播がL層をまたいでも制御され、勾配の爆発や消失が指数的に発生しないことを示す。
小さな摂動に対する安定性分析を提供し、NTKの振る舞いを通じた一般化への示唆を論じる。

Figure 1: Landscapes of the CIFAR10 image-classification training objective $F(W)$ near the SGD training trajectory. The blue vertical stick marks the current point $W=W_{t}$ at the current iteration $t$ . The $x$ and $y$ axes represent the gradient direction $\nabla F(W_{t})$ and the most negativel

実験結果

リサーチクエスチョン

RQ1SGDを用いてランダム初期化から訓練された深層ニューラルネットワークは、 mild over-parameterization と非退化データの下で訓練誤差をゼロに達成できるか？
RQ2隠れ層の幅は、n・L・データ分離δの多項式としてどれくらい大きくすれば多項式時間収束を保証できるか？
RQ3多層ネットワークの初期付近で、近凸性と半滑性が訓練景観に現れるか？
RQ4過剰パラメータ化ネットワークとNTKの有限幅での等価性は、無限幅の結果と同様に成立するか？
RQ5ReLU活性化を用いるCNNやResNet、その他の損失関数にもこの結果は拡張されるか？

主な発見

勾配降下法は、幅 m ≥ poly(n,L,δ^{-1})·d に対して、回帰タスクのε誤差グローバルミニマムを多項式時間で見つける。
SGDは適切な学習率とミニバッチサイズの下で、poly(n,L,δ^{-1})·log^2 m 回の反復で同じ訓練誤差目的を達成する。
初期付近では目的関数がほぼ凸で半滑であり、問題となるサドル点を排除し、必ず下降を保証する。
有限幅設定において過剰パラメータ化ネットワークとNTKの多項式幅の等価性がある（無限幅だけでなく）。
非滑らかなReLU活性化を扱い、CNNおよびResNetにも広く適用可能な結果へ拡張できる。

Figure 2: ResNet-32 architecture [ 58 ] landscape on CIFAR10 vs CIFAR100.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。