QUICK REVIEW

[論文レビュー] Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv, Naftali Tishby|arXiv (Cornell University)|Mar 2, 2017

Neural Networks and Applications参考文献 17被引用数 800

ひとこと要約

本論文は Information Plane における DNNs の可視化を通じて SGD のダイナミクスを明らかにし、2つのフェーズ（ERM と圧縮）、層の IB 境界への収束、追加の隠れ層による顕著な計算上の利点を示している。

ABSTRACT

Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the extit{Information Plane}; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. Our main results are: (i) most of the training epochs in standard DL are spent on {\emph compression} of the input to efficient representation and not on fitting the training labels. (ii) The representation compression phase begins when the training errors becomes small and the Stochastic Gradient Decent (SGD) epochs change from a fast drift to smaller training error into a stochastic relaxation, or random diffusion, constrained by the training error value. (iii) The converged layers lie on or very close to the Information Bottleneck (IB) theoretical bound, and the maps from the input to any hidden layer and from this hidden layer to the output satisfy the IB self-consistent equations. This generalization through noise mechanism is unique to Deep Neural Networks and absent in one layer networks. (iv) The training time is dramatically reduced when adding more hidden layers. Thus the main advantage of the hidden layers is computational. This can be explained by the reduced relaxation time, as this it scales super-linearly (exponentially for simple diffusion) with the information compression from the previous layer.

研究の動機と目的

精度指標を超えた深層ネットワークの学習ダイナミクスの理解を促進する。
入力と出力の相互情報を通じた表現を調査し、層が情報をどのように圧縮するかを特定する。
学習された表現が層全体で Information Bottleneck (IB) の境界へ収束することを示す。
訓練を加速する隠れ層の計算上の利点と役割を評価する。

提案手法

各層を P(T|X) をエンコーダと P(Y|T) をデコーダとする単一のランダム変数として扱う。
各層について情報平面を形成するために相互情報 I(X;T) および I(T;Y) をプロット・分析する。
完全に接続されたネットワークに対してクロスエントロピー損失を用いた stochastic gradient descent (SGD) を用い、訓練フェーズを調べる。
2 つの SGD 主導フェーズを特徴づける：経験的誤差最小化（ERM）フェーズと表現圧縮（拡散）フェーズ。
収束した層を IB の自己整合方程式と比較し、エンコーダ-デコーダ関係を介して IB 最適性を検証する。
隠れ層の追加が収束スピードと拡散ダイナミクスに与える計算上の影響を検討する。

実験結果

リサーチクエスチョン

RQ1DNN の層は訓練中に Information Plane 上で予測可能な軌道を描くだろうか？
RQ2SGD のダイナミクスはどのように ERM と圧縮フェーズに分離され、それらを駆動するのは何か？
RQ3収束した層は Information Bottleneck の自己整合方程式を満たすか？
RQ4追加の隠れ層が訓練速度と表現の圧縮という点でどんな計算上の利点を提供するか？
RQ5さまざまな訓練データサイズで層はIB 最適表現にどれくらい近づくか？

主な発見

訓練は2つのフェーズで進行する：ラベルに関する情報を増加させる初期の ERM フェーズと、入力に関する情報を減少させる長い圧縮フェーズ。
収束した層は Information Bottleneck の境界上または近くに位置し、その自己整合方程式を満たす。
隠れ層はより速い圧縮を可能にして良い汎化を達成するのに必要な訓練エポック数を劇的に削減し、結果として計算上の利点を提供する。
SGD 中の圧縮は拡散のように振る舞い、重み更新は訓練誤差によって制約された Wiener 過程に近づき、エントロピーの最大化をもたらす。
最終的な表現は非常に確率的でネットワーク間で多様であり、多くの異なるネットワークがほぼ最適な性能を達成する。
層は IB 曲線の臨界領域付近の点へ収束する傾向があり、相転移近くでの臨界的減速と整合している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。