QUICK REVIEW

[论文解读] Micro-Batch Training with Batch-Channel Normalization and Weight Standardization

Siyuan Qiao, Huiyu Wang|arXiv (Cornell University)|Mar 25, 2019

Intravenous Infusion Technology and Safety参考文献 60被引用 123

一句话总结

本文提出 Weight Standardization (WS) 和 Batch-Channel Normalization (BCN)，以实现有效的微批量训练，展示了损失面理论上的平滑性以及在视觉任务中的经验性提升。WS 和 BCN 旨在在不需要大批量大小的情况下复制 BN 风格的好处。

ABSTRACT

Batch Normalization (BN) has become an out-of-box technique to improve deep network training. However, its effectiveness is limited for micro-batch training, i.e., each GPU typically has only 1-2 images for training, which is inevitable for many computer vision tasks, e.g., object detection and semantic segmentation, constrained by memory consumption. To address this issue, we propose Weight Standardization (WS) and Batch-Channel Normalization (BCN) to bring two success factors of BN into micro-batch training: 1) the smoothing effects on the loss landscape and 2) the ability to avoid harmful elimination singularities along the training trajectory. WS standardizes the weights in convolutional layers to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients; BCN combines batch and channel normalizations and leverages estimated statistics of the activations in convolutional layers to keep networks away from elimination singularities. We validate WS and BCN on comprehensive computer vision tasks, including image classification, object detection, instance segmentation, video recognition and semantic segmentation. All experimental results consistently show that WS and BCN improve micro-batch training significantly. Moreover, using WS and BCN with micro-batch training is even able to match or outperform the performances of BN with large-batch training.

研究动机与目标

促使我们需要在微批量训练（每个GPU 1-2 张图像）下也能良好工作的归一化技术。
将 BN 类似的好处（损失面平滑和避免消除奇异性）扩展到微批量 regime。
提出 WS 将卷积权重标准化，以及 BCN 将批次统计和通道统计结合起来，以提升训练的稳定性和性能。
在多种计算机视觉任务上评估 WS 和 BCN，以验证实际提升。

提出的方法

提出 Weight Standardization (WS)：将卷积权重重参数化为 WS(W)，其中 W 在每个输出通道上标准化为零均值和单位方差。
引入 Batch-Channel Normalization (BCN)：将批次统计与通道统计结合起来以估计激活的均值和方差。
提供理论分析，表明 WS 降低损失和梯度的 Lipschitz 常数，从而平滑损失面。
分析消除奇异性，表明 BN 能唯一地使激活远离此类奇异性；并论证 WS/BCN 将类似属性扩展到微批量情形。
将 WS 与 Weight Normalization (WN) 及 Centered Weight Normalization (CWN) 进行比较。
证明 WS+BCN 在跨任务中能够与大批量的 BN 相当或优于在微批量情况下的 GN。

实验结果

研究问题

RQ1WS 和 BCN 是否能够在微批量训练中再现 BN 的好处（损失面平滑和避免消除奇异性）？
RQ2在小批量尺寸下，WS 和 BCN 是否提升训练速度和最终精度？
RQ3WS 和 BCN 如何相较于现有的归一化方法（GN/LN）以及大批量的 BN？
RQ4WS 对 Lipschitz 常数和消除奇异性的理论影响是什么？
RQ5在常见的 CNN 架构中，跟随标准归一化层时，WS 和 BCN 是否有效？

主要发现

WS 降低损失和梯度的 Lipschitz 常数，平滑优化景观。
WS 和 BCN 有助于将网络推离消除奇异性，提升训练稳定性。
GN+WS 在微批量训练下可达到或超过在选定任务上的大批量 BN。
BCN 在大批量和微批量设置下都对 GN 或 BN 提供额外的性能提升。
实证评估覆盖图像分类、目标检测、实例分割、视频识别和语义分割，在使用 WS 和 BCN 时取得一致的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。