QUICK REVIEW

[論文レビュー] On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang|arXiv (Cornell University)|Feb 12, 2020

Power Transformer Diagnostics and Insulation被引用数 109

ひとこと要約

本論文は、レイヤー正規化の配置がトランスフォーマーの最適化に与える影響を分析し、Pre-LNはウォームアップ段階なしでの訓練とより速い収束を可能にする一方、Post-LNは安定性のためにウォームアップに依存することを示している。

ABSTRACT

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

研究の動機と目的

Post-LN トランスフォーマーにとって学習率のウォームアップがなぜ重要か、そしてレイヤー正規化の配置が勾配の挙動にどのように影響するかを動機付ける。
平均場理論を用いて、初期化時の勾配スケールを Post-LN および Pre-LN の変種で理論的に分析する。
Pre-LN でウォームアップを削除できるかを実証的に検証し、NLPタスク全体で訓練速度と性能を測定する。

提案手法

Post-LN および Pre-LN トランスフォーマーに対する初期化時の勾配スケールを研究するための平均場理論。
最後の FFN 層の勾配ノルムと、それが深さ L に依存することの理論的分析。
IWSLT14 De-En、WMT14 En-De、BERT の事前学習における実証実験で、ウォームアップあり vs なしの設定を比較する。
制御された初期化: 単一ヘッド注意機構、Xavier 初期化、注意機構における Q/ K をゼロ、ガウス入力。
Adam および SGD/RAdam 変種を用いて、ウォームアップの有無とともに Post-LN 対 Pre-LN アーキテクチャを比較する。

実験結果

リサーチクエスチョン

RQ1初期化時に Pre-LN トランスフォーマーにおいて学習率のウォームアップ段階は不要になるのか？
RQ2レイヤー正規化の配置は、勾配スケールとトランスフォーマーの訓練安定性にどのように影響するのか？
RQ3ウォームアップなしの Pre-LN トランスフォーマーは、翻訳と事前学習タスクにおいて、Post-LN の基準と比較して同等またはより速い収束と最終性能を達成できるか？

主な発見

Post-LN トランスフォーマーは初期化時に出力層付近で大きな勾配を示し、ウォームアップなしだと大きな学習率が不安定になる。
Pre-LN トランスフォーマーは初期化時に勾配が安定しており、ウォームアップ段階の削除を可能にする。
IWSLT14 De-En、WMT14 En-De、BERT 事前学習全般で、ウォームアップなしの Pre-LN は速度と最終性能の点で、ウォームアップありの Post-LN に匹敵するか、それを上回る。
同じ lr_max 設定の下で Pre-LN の訓練は Post-LN よりも収束が速く、ハイパーパラメータ感度と訓練時間を削減する。
ウォームアップの削除は、収束の高速化やハイパーパラメータの調整回数の削減といった著しい速度改善をもたらしつつ、競争力のある結果を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。