QUICK REVIEW

[論文レビュー] Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Jiaoyang Huang, Horng‐Tzer Yau|arXiv (Cornell University)|Sep 18, 2019

Model Reduction and Neural Networks参考文献 34被引用数 42

ひとこと要約

本論文は、neural tangent hierarchy (NTH)を用いて深層ネットワークの有限幅勾配降下ダイナミクスを記述し、NTKは1/mのオーダーで変化することを示し、 tunable accuracy を持つ NTK ダイナミクスを近似する切り捨てを提案する。

ABSTRACT

The evolution of a deep neural network trained by the gradient descent can be described by its neural tangent kernel (NTK) as introduced in [20], where it was proven that in the infinite width limit the NTK converges to an explicit limiting kernel and it stays constant during training. The NTK was also implicit in some other recent papers [6,13,14]. In the overparametrization regime, a fully-trained deep neural network is indeed equivalent to the kernel regression predictor using the limiting NTK. And the gradient descent achieves zero training loss for a deep overparameterized neural network. However, it was observed in [5] that there is a performance gap between the kernel regression using the limiting NTK and the deep neural networks. This performance gap is likely to originate from the change of the NTK along training due to the finite width effect. The change of the NTK along the training is central to describe the generalization features of deep neural networks. In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks. We derive an infinite hierarchy of ordinary differential equations, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network. Moreover, under certain conditions on the neural network width and the data set dimension, we prove that the truncated hierarchy of NTH approximates the dynamic of the NTK up to arbitrary precision. This description makes it possible to directly study the change of the NTK for deep neural networks, and sheds light on the observation that deep neural networks outperform kernel regressions using the corresponding limiting NTK.

研究の動機と目的

勾配流の下での深層全結合ネットワークのトレーニングダイナミクスを動機付け・分析する。
NTK のデータ依存・幅に敏感なダイナミクスを捉える無限階層(NTH)を導出する。
高次カーネルの事前評価を提供し、NTK の変動がO(1/m)であることを示す。
十分に大きな幅に対して、NTK ダイナミクスを任意の精度で近似する切り捨てられた NTH を提案する。

提案手法

H 層の隠れ層を持つ深い全結合ネットワーク上で連続時間の勾配降下法(勾配流)を定式化する。
層ごとのカーネル G_t^(l) の和として neural tangent kernel K_t^(2)(·,·) を定義し、データ依存性を示す。
f(t) と r≥2 の高次カーネル K_t^(r) を結ぶ無限個の常微分方程式系として neural tangent hierarchy を導出する。
高次カーネル K_t^(r) に対する事前境界を確立し、K_t^(2) が O(1/m) に変化することを証明する。
∂_t K_t^(p)=0 を設定して切り捨てられた NTH を導入し、その近似誤差を分析する。
収束結果と、前提条件下で勾配降下法が訓練損失をゼロにする条件を示す（線形/指数的速度）。

実験結果

リサーチクエスチョン

RQ1有限幅の深層ネットワークの勾配流ダイナミクスは、その進化を記述する正確な無限階層(NTH)を受け入れるか。
RQ2高次NTK様カーネル K_t^(r) はどのように振る舞い、事前に境界付け可能か。
RQ3有限幅ネットワークの訓練中の NTK 変動は1/m のオーダーか、これは一般化と学習ダイナミクスにどのような影響を与えるか。
RQ4NTH の有限レベルの切り捨てが実用的な幅に対して NTK ダイナミクスを正確に近似できるか、幅は近似誤差にどう影響するか。
RQ5広いネットワークで勾配降下法が訓練誤差をゼロに収束する条件は何か、既存の結果を超える改善は可能か。

主な発見

深層ネットワークの勾配降下ダイナミクスは無限の neural tangent hierarchy (NTH) で記述できる。
事前評価を有する高次カーネルが存在し、NTK の変動は前提条件の下で O(1/m) である。
切り捨てられた NTH は NTK ダイナミクスへの誤差が制御可能であり、幅が大きくなるにつれて改善する。
m ≳ n^3 のとき、切り捨てられた階層は指定された時間範囲内でほぼ NTK ダイナミクスを追跡し、誤差項は m の増大とともに減少する。
K_0^(2) の正の最小固有値条件の下で、十分に広いネットワークのために勾配流は訓練誤差の指数減衰（線形速度）を達成する。
より広いネットワーク（大きな m）は有効な近似時間を長くし、切り捨て誤差を小さくすることを可能にし、NTK の進化と学習性能に対する幅-深さの利点を示唆する。
系統的に、関連する従来の結果に対する収束保証の幅の条件が三乗次元的に改善されることを示唆する推定結果（コリオリ的）により、幅の要件が四乗から三乗へと改善することを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。