QUICK REVIEW

[论文解读] Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Jonathan Frankle, David J. Schwab|arXiv (Cornell University)|Feb 29, 2020

Domain Adaptation and Few-Shot Learning参考文献 36被引用 79

一句话总结

论文表明仅训练 BatchNorm 仿射参数（gamma 和 beta），其余权重保持冻结，仍能获得出人意料的高精度，展示在深度 CNN 的随机特征上按特征仿射变换的强表达能力。

ABSTRACT

A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.

研究动机与目标

评估 BatchNorm 仿射参数（gamma 和 beta）在所有其他网络权重在初始化时冻结情况下的表达能力
量化仅训练 BatchNorm 在 CIFAR-10 和 ImageNet 上的表现与完全训练的网络相比的差异
在仅 BatchNorm 可训练的情况下，深度和宽度如何影响性能
分析 gamma 和 beta 值的演化及其对特征剪枝和稀疏性的贡献

提出的方法

将除 BatchNorm 仿射参数（gamma 和 beta）外的所有网络权重在随机初始化时冻结，仅训练 BatchNorm 仿射参数
在 CIFAR-10 和 ImageNet 上评估不同深度和宽度的 ResNet 网络
将性能与完全训练的网络以及等数量的随机选取参数进行比较
分析学习到的 gamma/beta 的分布及其对特征稀疏性和激活稀疏性的影响

实验结果

研究问题

RQ1当在随机特征上单独训练时，按特征的 BatchNorm 参数有多大表达能力？
RQ2通过仅训练 gamma 和 beta，在深度 CNN 上可以在 CIFAR-10 和 ImageNet 上达到何种准确率？
RQ3在这种受限训练方案下，网络的深度与宽度如何影响性能？
RQ4gamma/beta 是否学习禁用子集特征，其对激活有何影响？

主要发现

仅训练 gamma 和 beta 就能获得相对训练随机参数子集的高精度（例如，CIFAR-10 在深度网络下可达 82%，ImageNet 的 top-5 可达 32%）。
等量的随机参数远不如 BatchNorm 仿射参数，凸显 gamma 和 beta 的按特征能力。
在 BatchNorm 仅训练的情况下，gamma 学会抑制大约四分之一到三分之一的通道（数值接近零），指示按特征的稀疏性。
对比 BatchNorm 仅训练，越深越宽的网络可以提升 BatchNorm 仅训练的准确度，且在给定 BatchNorm 参数预算下，深度的贡献大于宽度。
当输出与 BatchNorm 一同训练时，准确度进一步提升，表明仿射参数是关键但并非单独足以达到 SOTA 表现。
在 BatchNorm 仅训练时，激活呈现稀疏化，显著比例的特征被 gamma 降近于零而实际禁用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。