Skip to main content
QUICK REVIEW

[論文レビュー] NormFormer: Improved Transformer Pretraining with Extra Normalization

Sam Shleifer, Jason Weston|arXiv (Cornell University)|Oct 18, 2021
Topic Modeling参考文献 34被引用数 28
ひとこと要約

NormFormer は Pre-LN モデルの各トランスフォーマー層に 3 つの軽量な正規化ベースの操作を追加し、勾配不一致を減らし、事前学習を加速しつつ、因果言語モデルとマスクド言語モデルの困惑度と下流タスクを改善します。

ABSTRACT

During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq https://github.com/pytorch/fairseq/tree/main/examples/normformer .

研究の動機と目的

  • Identify gradient magnitude mismatches in Pre-LN transformers during pretraining.
  • Propose lightweight normalization-based additions to stabilize and accelerate training.
  • Evaluate NormFormer on causal and masked language models across multiple scales.
  • Demonstrate improvements in pretraining perplexity and downstream task performance.
  • Provide ablations and analyses to understand the contribution of each addition.

提案手法

  • Introduce three additions per layer: head-wise scaling of MHA outputs (HeadScale), a LayerNorm after the attention module, and a LayerNorm after the first FFN layer.
  • Apply an additional LayerNorm inside the MHA path and a second LN after FFN, with small learnable parameters γ per head and per residual path.
  • Optionally include residual scaling (ResScale) on the FFN path, analyzed for its impact at different scales.
  • Train causal and masked language models across sizes 125M, 355M, 1.3B, and 2.7B, comparing NormFormer to compute-matched baselines under equal compute budgets.
  • Experiment with zero-shot evaluations on GPT-3-like tasks and GLUE benchmarks to assess generalization.

実験結果

リサーチクエスチョン

  • RQ1Does adding NormFormer’s extra normalization operations stabilize Pre-LN transformers and close gradient gaps across layers?
  • RQ2Do NormFormer gains persist across model scales from 125M to 2.7B parameters?
  • RQ3How do the added operations impact pretraining perplexity and downstream task performance (GLUE) compared to tuned Pre-LN baselines?
  • RQ4What is the effect of residual scaling in NormFormer across different model scales?
  • RQ5Are the gains robust to ablations removing any of the added components?

主な発見

Model Size (|θ|, M)λ_residPPLCoLAMNLIMRPCQNLIQQPRTESST-2Avg
125-125.423.4274.385.984.691.690.766.492.983.77
125-NormFormer125.50-3.3182.686.386.091.991.367.993.885.69
125-NormFormer125.51-3.2980.986.285.391.591.262.894.284.59
355-GPT3-355M (paper)355.0-3e-4--------
355-GPT3-355M (replicated)355.0-15.4146.170.854.671.141.256.8
355-NormFormer-355M355.0-14.5449.771.856.073.843.659.0
355-NormFormer-355M355.0-14.5249.772.056.773.243.859.1
1300-GPT3-1.3B (paper)1313.5-2e-4--------
1300-GPT3-1.3B (replicated)1313.5-12.5658.574.658.176.849.463.5
1300-GPT3-1.3B (High LR)1313.5-6e-457.574.359.376.350.863.6
1300-NormFormer-1.3B1314.0-6e-460.574.560.177.550.864.7
2649-GPT3-2.7B (paper)2648.7-1.6e-4--------
2649-GPT3-2.7B (replicated)2648.7-10.9265.976.661.478.249.666.3
26496e-4NormFormer-2.7B2649.5-6e-468.178.164.479.453.468.7
  • NormFormer improves pretraining perplexity and downstream task performance for both causal and masked language models across sizes 125M–2.7B.
  • For 1.3B models, NormFormer matches baseline perplexity faster and can reach equal perplexity 24% faster under compute parity; it can converge 0.27 perplexity better in the same compute budget.
  • Zero-shot evaluation shows NormFormer outperforms GPT-3 at all sizes on the tested tasks.
  • GLUE fine-tuning results show NormFormer MLMs outperform Pre-LN baselines across tasks, with average gains.
  • Ablation studies show removing any added operation degrades performance; HeadScale and post-attn LN are particularly impactful.
  • Learned scaling parameters (γ) reduce early-layer FG gradients and downscale early FFN inputs, while HeadScale can emphasize certain heads, aiding stability and performance.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。