QUICK REVIEW

[論文レビュー] Understanding and Improving Layer Normalization

Jingjing Xu, Xu Sun|arXiv (Cornell University)|Nov 16, 2019

Natural Language Processing Techniques被引用数 175

ひとこと要約

論文は、平均と分散の導関数が LayerNorm の有効性を駆動し、前方正規化ではなくそれが影響することを示す。AdaNorm を導入し、バイアス/ゲインを適応変換に置換して、ほとんどのタスクで性能を向上させる。

ABSTRACT

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.

研究の動機と目的

LayerNorm を前方入力正規化を超えて何が効果をもたらすのかを調査する。
LayerNorm におけるバイアスとゲインの役割と、それらが過剰適合に与える影響を評価する。
平均/分散の導関数が勾配にどのように影響を与えるかを分析する。
バイアス/ゲインを適応変換に置換する AdaNorm を提案し、性能を評価する。

提案手法

複数タスクにわたり LayerNorm のバリアントを再現し、Normなしの基線と比較する。
平均と分散の導関数を切り離す DetachNorm を導入し、前方・後方の影響への影響を測定する。
平均/分散の導関数による勾配の再センタリングと再スケーリングを LayerNorm で理論的に分析する（定理1）。
正規化後の特徴に適用される変換 phi(y) を用いた AdaNorm を提案し、微分可能性と平均出力の有界性を保証する（定理2）。
MT、LM、分類、解析、OCR、NLP タスクを網羅する8データセットで LayerNorm、LayerNorm-simple、DetachNorm、AdaNorm を経験的に比較する。

実験結果

リサーチクエスチョン

RQ1LayerNorm の効果を支える要素は、前方正規化か、それとも平均/分散の勾配効果か？
RQ2バイアスとゲインは正の寄与をするのか、それとも多くのタスクで過剰適合を生むのか？
RQ3勾配正規化（平均/分散の導関数による）で LayerNorm の学習挙動と性能を説明できるのか？
RQ4適応正規化（AdaNorm）は、入力依存のスケーリングで固定の仮定パラメータを置換することで LayerNorm より優れているのか？

主な発見

モデル	En-De (BLEU)	De-En (BLEU)	En-Vi (BLEU)	Enwiki8 (ビット/文字)	RT (ACC)	SST5 (ACC)	MNIST (ACC)	PTB (UAC)
w/o Norm	Diverge	34.0	28.4	1.04	76.85	38.55	99.14	88.31
LayerNorm	28.3	35.5	31.2	1.07	77.21	39.23	99.13	89.12
LayerNorm-simple	28.4	35.5	31.6	1.07	76.66	40.54	99.09	89.19
AdaNorm	28.5	35.6	31.4	1.07	77.50	40.54	99.35	89.23

前方正規化は LayerNorm の成功を説明するには限定的であり、平均と分散の導関数がより重要である。
平均/分散の導関数を切り離す DetachNorm は性能を低下させ、勾配正規化が最適化を支援することを示す、特に深いモデルで。
LayerNorm-simple（バイアス/ゲインなし）はいくつかのデータセットで LayerNorm に匹敵するか上回り、En-Vi MT で最先端を達成できる。
バイアスとゲインは過剰適合を助長する傾向があり、タスク横断で効果的でないことが多い。
AdaNorm はバイアス/ゲインを適応的な phi(y) に置換し、8データセット中7つで LayerNorm を上回り、より良い一般化を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。