QUICK REVIEW

[論文レビュー] Layer Normalization

Jimmy Ba, Jamie Kiros|arXiv (Cornell University)|Jul 21, 2016

Neural Networks and Applications参考文献 22被引用数 498

ひとこと要約

Layer Normalizationを導入し、RNNを含むさまざまなニューラルネットの学習を安定化させ、ミニバッチ全体ではなく各レイヤ内で加算入力を正規化することにより学習を高速化する。

ABSTRACT

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

研究の動機と目的

正規化を通じて深層ネットの学習時間短縮を動機づける。
Layer normalizationをオンラインかつRNNsに対して動作する batch normalization の代替として提案する。
正規化下での不変性特性と学習ダイナミクスを分析する。
複数のタスクとアーキテクチャを横断してLayer normalizationを経験的に検証する。

提案手法

正規化のために隠れユニット全体を横断する層ごとの平均と分散を計算する。
非線形性の前に正規化後に適応ゲインとバイアスを適用する。
RNNsでは現在の層の統計量を用いて各時刻で正規化する（Eq. 4）。
不変性特性を batch normalization および weight normalization（セクション5）と比較する。
Fisher information を用いた理論的解析を提供し、暗黙の学習率効果を議論する。
image-sentence ranking、QA、language modeling、skip-thoughts、handwriting、MNIST、CNNs などで経験的に評価する。

実験結果

リサーチクエスチョン

RQ1Layer normalizationは多様なアーキテクチャ（RNNs、CNNs、DRAW）とタスク全体で学習速度と汎化性能を改善しますか？
RQ2Layer normalization下の不変性特性と学習ダイナミクスは、batch normalization および weight normalization とどのように比較されますか？
RQ3Layer normalizationは時間ステップ固有の統計量を用いずに、RNNsのオンライン学習と長い系列の学習を可能にしますか？
RQ4実践上、長い系列と小さなミニバッチに対するLayer normalizationの経験的影響は何ですか？

主な発見

Layer normalizationは学習を高速化し、特にリカレントネットワークと長い系列で汎化性能を向上させる。
Layer normalizationは学習ごとの特徴量のシフトとスケーリングに対して不変であり、ミニバッチサイズに依存しない。
層内で加算入力を再中心化・再スケーリングすることにより、隠れ状態のダイナミクスを安定させる（Eq. 3 and 4）。
image-sentence ranking、QA、skip-thoughts、DRAW、handwriting、permutation-invariant MNIST などのタスクで収束の速度とバリデーション性能の向上を示す。
Layer normalizationは再帰モデルにおいて、recurrent batch normalizationより初期ゲインスケールに敏感さが低い。
CNNsではLayer normalizationはベースラインに比べて高速化するが、設定によってはbatch normalizationの方が依然として性能を上回ることがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。