QUICK REVIEW

[论文解读] Root Mean Square Layer Normalization

Biao Zhang, Rico Sennrich|arXiv (Cornell University)|Oct 16, 2019

Neural Networks and Applications参考文献 30被引用 101

一句话总结

RMSNorm 通过用 RMS 对求和后的输入进行归一化来替代 LayerNorm，移除均值居中，以在不同任务上实现更快的训练速度且性能相当。

ABSTRACT

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm. We also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models. Source code is available at https://github.com/bzhangGo/rmsnorm.

研究动机与目标

推动在 LayerNorm 中移除均值居中，并测试基于 RMS 的重新缩放是否足以实现稳定训练。
提出 RMSNorm 和部分 RMSNorm（p RMSNorm）作为 LayerNorm 的即插即用替代方案。
在 NLP、视觉和跨模态任务中评估 RMSNorm，以评估准确性和加速效益。

提出的方法

通过将神经元输入除以 RMS(a) 并用增益 g 进行缩放来归一化（方程式 4）。
相对于 LayerNorm 的就地替代解释及不变性特性分析（表 1）。
推导在 RMSNorm 下的梯度以显示稳定性和隐式学习率自适应（方程 8，方程 9）。
引入 p RMSNorm，从前 p% 的求和输入中估计 RMS（k = ceil(n*p)）。
在多种架构和框架中与 LayerNorm、BatchNorm 及其他基线进行比较。

实验结果

研究问题

RQ1RMSNorm 是否在不同模型和数据集上达到与 LayerNorm 相当的任务性能？
RQ2RMSNorm 是否能在保持准确性的同时相对于 LayerNorm 提供训练速度提升？
RQ3在输入/权重缩放下，RMSNorm 的不变性与梯度性质是什么？
RQ4部分 RMSNorm（p RMSNorm）在准确性与效率之间的权衡如何？
RQ5RMSNorm 是否对不同初始化和架构（RNN、CNN、Transformer）具有鲁棒性？

主要发现

模型	测试14	测试17	时间
基线	21.7	23.4	399 ± 3.40s (000%)
LayerNorm	22.6	23.6	665 ± 32.5s (000%)
L2-Norm	20.7	22.0	482 ± 19.7s (000%)
RMSNorm	22.4	23.7	501 ± 11.8s (24.7%)
p RMSNorm	22.6	23.1	493 ± 10.7s (25.9%)

RMSNorm 在 RNNSearch 上的 BLEU 分数与 LayerNorm 相当，在一个设置中提升了 24.7% 的速度，在另一个设置中提升了 25.9%（表 2）。
RMSNorm 在模型、架构和框架中的速度提升相较于 LayerNorm 为 7%–64%（摘要与实验）。
在 Transformer 实验中，RMSNorm 的 BLEU 分数与 LayerNorm 相当，速度提升为 7%–9%（表 5）。
部分 RMSNorm（6.25%）通常与 RMSNorm 性能相近，且带来显著但框架相关的加速（表 2，表 3）。
RMSNorm 在没有显式输入均值归一化的情况下稳定激活与梯度，并且可以作为 LayerNorm 的稳健就地替代（关于不变性和鲁棒性的讨论）。
在任务（机器翻译、图像-字幕检索、CNN/CIFAR-10）中，RMSNorm 始终提升相对于 Baseline 的收敛速度，并且在效率方面常常达到或超过 LayerNorm（表 2–10）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。