QUICK REVIEW

[论文解读] Restructuring Batch Normalization to Accelerate CNN Training

Wonkyung Jung, Dae-Jin Jung|arXiv (Cornell University)|Jul 4, 2018

Advanced Neural Network Applications参考文献 49被引用 52

一句话总结

该论文提出 BN Fission-n-Fusion (BNFF) 来重构 Batch Normalization 层，减少内存访问并提升现代 CNN 的训练速度，如 DenseNet-121 和 ResNet-50；在 Skylake CPU 上对 DenseNet-121 训练速度提升可达 25.7%，对 ResNet-50 为 16.1%。

ABSTRACT

Batch Normalization (BN) has become a core design block of modern Convolutional Neural Networks (CNNs). A typical modern CNN has a large number of BN layers in its lean and deep architecture. BN requires mean and variance calculations over each mini-batch during training. Therefore, the existing memory access reduction techniques, such as fusing multiple CONV layers, are not effective for accelerating BN due to their inability to optimize mini-batch related calculations during training. To address this increasingly important problem, we propose to restructure BN layers by first splitting a BN layer into two sub-layers (fission) and then combining the first sub-layer with its preceding CONV layer and the second sub-layer with the following activation and CONV layers (fusion). The proposed solution can significantly reduce main-memory accesses while training the latest CNN models, and the experiments on a chip multiprocessor show that the proposed BN restructuring can improve the performance of DenseNet-121 by 25.7%.

研究动机与目标

促使理解非卷积层，尤其是 Batch Normalization，在训练现代深度 CNN 中日益重要的原因。
分析在像 DenseNet-121 这样的深度模型训练过程中，BN 层的内存带宽瓶颈。
开发 BN 结构重组（分裂与融合），以最小化主存访问。
展示在 CPU（Skylake）和 GPU 平台上，DenseNet-121 与 ResNet-50 的性能提升。

提出的方法

将一个 BN 层分裂为两个子层（fission）。
将第一子层与前面的 CONV 层融合（CONV1-(sub-BN1)）。
将第二子层与后续的 ReLU 和 CONV 层融合（sub-BN2-ReLU-CONV2）。
使用均值/方差融合将 BN 的均值和方差计算合并为一次内存扫描（MVF）。
可选地将 BNFF 扩展为 Inter-Composite-Layer Fusion (ICF)，以在 DenseNet 的 CPL 边界处融合 BN。

实验结果

研究问题

RQ1在训练深度 CNN 时，BN 会带来多少内存访问量和带宽瓶颈？
RQ2BN 层是否能够通过分裂与融合进行重组，以在不影响准确性的前提下降低片外内存访问？
RQ3在将 BNFF 应用于 DenseNet-121 和 ResNet-50 时，CPU 和 GPU 平台上可实现哪些性能提升？
RQ4均值/方差融合是否会影响数值精度到值得用在 BN 重组中的程度？

主要发现

BNFF 在 DenseNet-121 和 ResNet-50 的训练中实现了显著的内存访问减少和加速。
在 Intel Skylake CPU 上，BNFF 对 DenseNet-121 的整体训练速度提升为 25.7%，对 ResNet-50 为 16.1%。
前向传播在 DenseNet-121 使用 BNFF 的增益达到 47.9%，而反向传播增益为 15.4%（DenseNet-121）。
均值/方差融合（MVF）和 ReLU-卷积融合（RCF）在 BNFF 之上提供额外增益（例如，在 Skylake 上 MVF 总体再增益 1.7%）。
Inter-Composite-Layer Fusion (ICF) 可能在 DenseNet 上比 BNFF 额外提升约 ~18%，通过在 CPL 边界进一步消除 BN 相关的内存访问。
BNFF 将内存访问量减少最多约 ~19%，同时提升缓存行为并降低子程序调用开销。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。