QUICK REVIEW

[论文解读] Layer Normalization

Jimmy Ba, Jamie Kiros|arXiv (Cornell University)|Jul 21, 2016

Neural Networks and Applications参考文献 22被引用 498

一句话总结

引入层归一化，通过在每一层内对汇总输入进行归一化来稳定并加速各种神经网络的训练，包括RNNs，而非跨小批量归一化。

ABSTRACT

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

研究动机与目标

通过归一化来提高深度网络训练速度的动机。
提出层归一化作为批量归一化的替代方案，能够在线工作并应用于 RNNs。
分析归一化下的不变性特性和学习动力学。
在多种任务和架构上对层归一化进行经验验证。

提出的方法

计算用于归一化的每层隐藏单元的均值和方差。
在归一化后且在非线性映射之前应用自适应增益和偏置。
对于 RNNs，在每个时间步使用当前层的统计量进行归一化（Eq. 4）。
将不变性特性与批量归一化和权重归一化（第5节）进行比较。
提供基于 Fisher 信息的理论分析，以讨论隐式学习率效应。
在图像-句子排序、问答、语言建模、skip-thoughts、手写、MNIST 和 CNNs 等任务上进行经验评估。

实验结果

研究问题

RQ1层归一化是否能在多种架构（RNNs、CNNs、DRAW）和任务上提升训练速度和泛化能力？
RQ2在层归一化下，不变性特性和学习动力学与批量归一化和权重归一化相比如何？
RQ3层归一化是否能够在 RNNs 中实现在线学习和长序列训练，而无需特定时间步的统计量？
RQ4在实际应用中，层归一化对长序列和小批量在实践中的影响是什么？

主要发现

层归一化在多任务上加速训练并提升泛化能力，尤其是对循环网络和长序列。
层归一化对每个训练案例的特征平移和缩放具有不变性，并且不依赖于小批量大小。
通过在层内对汇总输入进行重新居中和重新缩放（Eq. 3 和 4），它提供稳定的隐藏状态动力学。
实验表明，在图像-句子排序、QA、skip-thoughts、DRAW、手写和 permutation-invariant MNIST 等任务中收敛更快、验证性能更好。
层归一化对递归模型初始增益尺度的敏感性低于递归批量归一化。
在 CNNs 中，层归一化相对于基线加速，但在某些设置下批量归一化仍可能优于它。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。