QUICK REVIEW

[论文解读] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy|arXiv (Cornell University)|Feb 11, 2015

Neural Networks and Applications参考文献 23被引用 24,246

一句话总结

本文提出 Batch Normalization（批量归一化），一种在小批量内对层输入进行归一化的方法，以减少内部协变量偏移，从而实现更高的学习率、正则化和更快的训练，达到在 ImageNet 上的最先进结果。

ABSTRACT

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

研究动机与目标

在训练过程中激发深度网络中内部协变量偏移问题的动机。
提出一种集成到网络体系结构中的归一化技术，该技术在小批量上运行。
证明 BN 能实现更高的学习率并充当正则化器，降低或消除对 Dropout 的需求。
证明在使用 BN 时，进行加速的训练和在大规模视觉任务（ImageNet）上获得更高的准确性。
提供批量归一化网络的训练和推断的实用指南。

提出的方法

在非线性激活之前插入 Batch Normalization 变换，以使用 mini-batch 统计将每个激活维度归一化为零均值和单位方差。
学习每一维的缩放参数（gamma）和偏置参数（beta），以维持网络的表达能力。
通过 BN 变换进行反向传播，以更新 gamma、beta 和前面层的参数。
在推断阶段，使用总体统计量（或其移动平均）而非 mini-batch 统计量以获得确定性输出。
通过在批量和空间位置（对每个特征图）归一化特征图，将 BN 应用于卷积网络。
演示在更高的学习率下的训练、对初始化的敏感性较低，以及对 Dropout 需求的减少。

实验结果

研究问题

RQ1将批量级归一化整合到网络中是否会降低内部协变量偏移并加速深度网络的训练？
RQ2BN 是否能够在不发散的情况下实现更高的学习率并改善跨层的梯度流？
RQ3BN 对正则化和泛化的影响，与 Dropout 相比或结合使用时的效果如何？
RQ4BN 如何影响在像 ImageNet 这样的大规模视觉任务上的表现，包括单网络和集成结果？

主要发现

Batch Normalization 使学习率显著增高并降低对参数初始化的敏感性。
具有 BN 的网络收敛更快，可以在显著更少的训练步数下达到相同或更好的精度（例如在 ImageNet 变体上达到给定准确度所需步数减少约 14 倍）。
BN 在 ImageNet 上实现了最先进的结果，集成后 top-5 验证误差达到 4.9%（测试误差 4.8%）。
BN-Baseline 在训练步数不到原来一半的情况下达到 Inception 的准确性，且进一步的 BN 变体达到更高的最终准确性（例如 BN-x30 的验证集 top-5 达到 74.8%）。
Batch Normalization 在某些设置下减少或消除对 Dropout 的需求，并且在使用如 sigmoid 这样的饱和非线性函数时可以稳定训练。
BN 通过使层的雅可比矩阵对参数尺度的敏感性降低来改善梯度传播，并且可能对模型进行正则化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。