QUICK REVIEW

[论文解读] How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Devansh Arpit, Víctor Campos|arXiv (Cornell University)|Jun 5, 2019

Domain Adaptation and Few-Shot Learning参考文献 39被引用 18

一句话总结

本文提出了一种理论驱动的权重归一化ReLU网络初始化策略，适用于带残差连接与不带残差连接的网络，利用场均近似方法防止梯度爆炸/消失。实验表明，该方法可实现深层网络的稳定训练，减少与批量归一化方法之间的泛化差距，并通过在损失曲面低曲率区域初始化，支持使用较大的学习率。

ABSTRACT

Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initialization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.

研究动机与目标

解决权重归一化深层网络缺乏正式初始化策略的问题。
开发一种理论驱动的初始化方法，防止前向与反向传播中的信息流动问题（如梯度爆炸/消失）。
提升权重归一化深层网络的训练稳定性和泛化性能。
缩小权重归一化与批量归一化网络之间的性能差距。
在CIFAR数据集上，通过超过2,500次实验，验证方法在不同深度和超参数设置下的有效性。

提出的方法

利用场均近似方法，推导出适用于权重归一化ReLU网络的新型初始化策略。
通过尺度因子（g）和单位范数方向矩阵（Ŵ）重参数化权重，实现幅值与方向的解耦。
建立理论条件，确保在初始化阶段隐藏层激活值的范数在各层间保持稳定。
提出一种与网络深度相关的初始化缩放方法，确保前馈与残差架构中范数的一致性。
采用幂法计算初始化时Hessian矩阵的谱范数，以分析曲率。
将所提初始化方法与学习率热身策略结合，进一步提升性能。

实验结果

研究问题

RQ1如何为权重归一化的ReLU网络设计一种理论可靠的初始化方法，以防止梯度爆炸或消失？
RQ2与现有代理方法相比，所提初始化方法是否能提升深层网络的训练稳定性和泛化性能？
RQ3所提初始化方法是否能缩小权重归一化与批量归一化网络之间的泛化差距？
RQ4为何所提初始化方法相比标准初始化方案允许使用更大的学习率？
RQ5所提初始化方法在不同网络深度、超参数选择及随机种子变化下是否具有鲁棒性？

主要发现

在CIFAR-10上，结合学习率热身策略后，该方法将ResNet-56的测试误差降低至7.20%，ResNet-110降低至6.69%，达到或超过批量归一化性能。
在CIFAR-100上，结合Cutout和热身策略后，该方法将误差降低至25.31%（ResNet-164），接近批量归一化的25.52%误差。
该方法在初始化时的Hessian矩阵对数谱范数为1.31（CIFAR-10）和1.56（CIFAR-100），显著低于其他方法，表明曲率更低。
该方法可在现有初始化方案无法启动训练的极深网络中实现稳定训练。
与标准基线相比，该方法在不同随机种子下的性能方差显著降低。
该方法显著缩小了权重归一化与批量归一化网络之间的泛化差距，尤其在结合学习率热身策略时效果更明显。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。