QUICK REVIEW

[論文レビュー] Restructuring Batch Normalization to Accelerate CNN Training

Wonkyung Jung, Dae-Jin Jung|arXiv (Cornell University)|Jul 4, 2018

Advanced Neural Network Applications参考文献 49被引用数 52

ひとこと要約

この論文は BN Fission-n-Fusion (BNFF) を提案し、Batch Normalization 層を再構成してメモリアクセスを削減し、DenseNet-121 や ResNet-50 のような現代の CNN の訓練速度を向上させる。Skylake CPU 上では DenseNet-121 で最大 25.7% の訓練速度向上と ResNet-50 で 16.1% の訓練速度向上を実現する。

ABSTRACT

Batch Normalization (BN) has become a core design block of modern Convolutional Neural Networks (CNNs). A typical modern CNN has a large number of BN layers in its lean and deep architecture. BN requires mean and variance calculations over each mini-batch during training. Therefore, the existing memory access reduction techniques, such as fusing multiple CONV layers, are not effective for accelerating BN due to their inability to optimize mini-batch related calculations during training. To address this increasingly important problem, we propose to restructure BN layers by first splitting a BN layer into two sub-layers (fission) and then combining the first sub-layer with its preceding CONV layer and the second sub-layer with the following activation and CONV layers (fusion). The proposed solution can significantly reduce main-memory accesses while training the latest CNN models, and the experiments on a chip multiprocessor show that the proposed BN restructuring can improve the performance of DenseNet-121 by 25.7%.

研究の動機と目的

現代の深い CNN の訓練において、特に Batch Normalization をはじめとする非 CONV 層の重要性が高まっていることを動機づける。
DenseNet-121 などの深層モデルの訓練中に BN 層が生み出すメモリ帯域幅のボトルネックを分析する。
メインメモリへのアクセスを最小化するための BN の再構成（fission および fusion）を開発する。
DenseNet-121 および ResNet-50 に対して、CPU（Skylake）と GPU プラットフォームの両方で性能向上を実証する。

提案手法

BN 層を二つのサブ層に分割する（fission）。
最初のサブ層を直前の CONV 層と融合させる（CONV1-(sub-BN1)）。
二番目のサブ層を後続の ReLU および CONV 層と融合させる（sub-BN2-ReLU-CONV2）。
BN の平均と分散の計算を単一のメモリ走査に統合する mean/variance fusion を用いる（MVF）。
オプションで BNFF を Inter-Composite-Layer Fusion (ICF) で拡張し、DenseNet の CPL 境界にまたがる BN を融合する。

実験結果

リサーチクエスチョン

RQ1深層 CNN の訓練中に BN が生み出すメモリトラフィックと帯域幅のボトルネックはどれくらいか？
RQ2精度を損なうことなく、fission と fusion を用いて BN 層を再構成し、オフチップメモリへのアクセスを削減できるか？
RQ3BNFF を DenseNet-121 および ResNet-50 に適用した場合、CPU および GPU プラットフォームでどの程度の性能向上が得られるか？
RQ4mean/variance fusion は数値精度に十分な影響を及ぼし、BN 構造変更の正当性があるか？

主な発見

BNFF は DenseNet-121 および ResNet-50 の訓練中に大幅なメモリアクセス削減と速度向上を達成する。
Intel Skylake CPU で、BNFF は DenseNet-121 の全体訓練速度を 25.7%、ResNet-50 を 16.1% 向上させる。
BNFF を用いる DenseNet-121 では順伝播の利得が 47.9%、逆伝播の利得は 15.4%（DenseNet-121）に達する。
Mean/Variance Fusion (MVF) および ReLU-Convolution Fusion (RCF) は BNFF の上に追加の利得をもたらす（例：MVF は Skylake で全体で 1.7% の追加利得）。
Inter-Composite-Layer Fusion (ICF) は、DenseNet において CPL 境界で BN に関連するメモリアクセスをさらに削減することにより、BNFF より約 ~18% の追加改善を提供する可能性がある。
BNFF はメモリアクセスを最大で ~19% 削減し、キャッシュ動作を改善し、サブルーチン呼び出しによるオーバーヘッドを低減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。