QUICK REVIEW

[论文解读] Understanding the Difficulty of Training Transformers

Liyuan Liu, Xiaodong Liu|arXiv (Cornell University)|Apr 17, 2020

Topic Modeling参考文献 38被引用 28

一句话总结

本文指出，Transformer 训练不稳定性的根本原因在于对残差分支的过度依赖，这种依赖会放大参数扰动。为解决该问题，作者提出 Admin（自适应初始化方法），该方法在训练初期降低残差依赖以提升稳定性，后期则释放模型容量，从而在长序列翻译任务中实现最先进性能，包括在 72 层 Transformer 上实现 WMT’14 En-Fr 的 43.80 BLEU 分数。

ABSTRACT

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $ extit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($ extbf{Ad}$aptive $ extbf{m}$odel $ extbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

研究动机与目标

理解尽管在 NLP 任务中表现成功，为何训练 Transformer 仍具挑战性。
探究梯度不平衡或其他因素是否为训练不稳定的主因。
识别影响训练稳定性和模型容量的结构设计选择，尤其是残差分支依赖。
开发一种方法，在不牺牲后期模型潜力的前提下稳定早期训练。
在深度 Transformer 架构上，尤其是在长序列翻译任务中，实现最先进性能。

提出的方法

作者通过方差比分析每个 Transformer 层对残差分支的依赖程度，将依赖定义为 Var[f(x)] / Var[x + f(x)]，其中 f(x) 为残差输出。
对比 Post-LN 与 Pre-LN Transformer 架构，表明 Post-LN 层具有更强的残差依赖，导致在参数更新下更易产生不稳定性。
提出 Admin（自适应模型初始化），在初始化阶段动态调整残差连接的缩放系数，以降低早期训练中的依赖程度。
Admin 使用可学习的缩放因子，初始时抑制残差更新，随后随训练进程逐步增加，以释放模型容量。
该方法仅在模型初始化阶段应用，无需额外超参数或架构修改。
在 IWSLT’14 De-En、WMT’14 En-De 和 WMT’14 En-Fr 上进行实验，涵盖多种深度配置，包括 72 层模型。

实验结果

研究问题

RQ1除了梯度不平衡外，Transformer 中导致训练不稳定的结构性因素是什么？
RQ2为何 Post-LN 训练比 Pre-LN 更容易发散，尽管两者梯度行为相似？
RQ3对残差分支的依赖如何影响训练过程中参数扰动的传播？
RQ4能否通过在初始化阶段控制残差依赖来稳定深度 Transformer 训练，同时不牺牲模型容量？
RQ5自适应初始化方法是否能在深度架构中超越 Post-LN 和 Pre-LN 基线？

主要发现

Post-LN Transformer 的残差依赖显著高于 Pre-LN 变体，这种依赖会放大微小参数扰动，导致训练不稳定性。
Pre-LN 模型更稳定，但因残差依赖过弱而受限于模型容量，导致性能较差。
Admin 在所有评估的数据集和架构上均实现训练稳定，包括在 WMT’14 En-Fr 上的 72 层 Transformer，该模型此前使用标准方法无法训练。
在 60 层编码器和 12 层解码器的 WMT’14 En-Fr 上，Admin 实现了 43.80 的新 SOTA BLEU 分数。
Admin 超越了标准 Post-LN 和 Pre-LN 基线，以及微调后的 T5 模型，证明其能充分释放模型潜力。
该方法在不引入额外超参数或架构修改的前提下，实现了更快收敛与更好稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。