QUICK REVIEW

[论文解读] Rethinking Batch Normalization in Transformers

Sheng Shen, Zhewei Yao|arXiv (Cornell University)|Mar 17, 2020

Advanced Neural Network Applications被引用 5

一句话总结

本文提出了一种名为 Power Normalization (PN) 的新型归一化技术，用于自然语言处理中的 Transformer 模型，以解决因 NLP 数据中批次间统计波动较大而导致的批量归一化 (BN) 不稳定问题。通过放松零均值约束、使用运行中的二次均值，并采用近似反向传播，PN 在训练稳定性和性能方面优于 BN 和层归一化 (LN)，在 WMT14 上比 LN 提升 0.6 BLEU，在 WikiText-103 上提升 5.6 PPL。

ABSTRACT

The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103. We make our code publicly available at \url{this https URL}.

研究动机与目标

探究为何批量归一化 (BN) 在 NLP Transformer 中表现不如层归一化 (LN)。
识别 BN 在 NLP 中表现不佳的根本原因，特别是由于批次间统计波动过大导致的不稳定性。
设计一种新的归一化方案，通过解决这些波动问题来稳定 NLP 中的训练过程，同时保持训练效率。
通过理论分析证明所提方法的合理性，表明损失函数的利普希茨常数更小，且梯度有界。

提出的方法

放松批量归一化中的零均值约束，以降低对批次统计量的敏感性。
用运行中的二次均值替代每批次的统计量，以稳定训练过程中的统计特性。
引入一种近似反向传播方案，将运行统计量融入前向传播，以改善梯度流动。
理论分析表明，在温和假设下，PN 可使损失函数的利普希茨常数更小。
证明近似反向传播方案可产生有界的梯度，从而提升训练稳定性。
将该方法集成到 Transformer 架构中，并在多个 NLP 基准上进行评估。

实验结果

研究问题

RQ1为何标准批量归一化在 NLP Transformer 中相比层归一化性能下降？
RQ2NLP 数据的何种特定属性会导致使用朴素批量归一化时出现不稳定性？
RQ3一种采用运行统计量的改进归一化方案能否提升 NLP 中的训练稳定性和性能？
RQ4所提出的归一化方法是否能实现比现有方法更好的泛化能力和更快的收敛速度？
RQ5能否为所提方法建立理论保证，如梯度有界性和更小的利普希茨常数？

主要发现

Power Normalization (PN) 在 IWSLT14 上显著优于层归一化 (LN)，BLEU 提升 0.4。
在 WMT14 上，PN 比 LN 提升 0.6 BLEU，表明在翻译任务中具有持续的性能增益。
在 PTB 语言建模基准上，PN 相比 LN 将困惑度降低 5.6 点。
在 WikiText-103 上，PN 相比 LN 实现了 3.0 的困惑度改进，表明其语言建模能力更强。
理论分析确认，在温和假设下，PN 的损失函数利普希茨常数小于 BN。
PN 中的近似反向传播方案确保了梯度有界，从而增强了训练稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。