[论文解读] Mean Field Residual Networks: On the Edge of Chaos
本文通过场论方法分析了随机初始化的残差网络,表明其在跳跃连接的作用下处于混沌边缘,前向与反向传播动态呈现次指数(通常为多项式)特性。核心贡献在于提出了一套理论与实证相结合的框架,能够从初始化超参数预测网络性能,揭示最优方差依赖于网络深度,且与Xavier或He等标准初始化方法存在根本性差异。
We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.
研究动机与目标
- 通过场论理解随机初始化残差网络的动力学行为。
- 表征跳跃连接如何改变前向与反向传播动态,相较于普通网络。
- 识别依赖于深度与非线性的残差网络最优初始化方差。
- 建立初始化超参数与推理阶段性能之间的预测性关联。
- 推导涉及贝塞尔函数的新型数学恒等式,适用于ReLU类非线性激活。
提出的方法
- 应用场论分析输入向量间余弦距离在各层间的演化。
- 利用差分方程与不动点分析建模激活与梯度流动的动力学。
- 推导梯度方差与输入距离增长的精确渐近表达式,以网络深度与非线性性为参数。
- 提出新颖框架,基于初始化阶段的度量(如梯度爆炸或输入距离)预测推理阶段性能。
- 运用高级数学工具,包括积分表示与贝塞尔函数,分析α-ReLU非线性性。
- 通过在MNIST数据集上对多种激活函数与超参数的实证实验验证理论预测。
实验结果
研究问题
- RQ1残差网络中的跳跃连接相较于普通前馈网络,如何改变前向与反向传播动力学?
- RQ2残差网络中输入向量间余弦距离的渐近收敛速率为何?
- RQ3为何残差网络在随机初始化下仍能实现更好的泛化性能?
- RQ4残差网络的最优初始化方差如何依赖于深度与非线性性?
- RQ5能否基于初始化阶段计算的属性预测训练后网络的性能?
主要发现
- 残差网络中输入向量间余弦距离的收敛为多项式而非指数级,表明其处于混沌边缘。
- 对于α < 1的α-ReLU,梯度方差随深度仅呈多项式增长,避免了指数级爆炸。
- 初始化阶段对梯度爆炸与输入距离的理论预测,能准确预测不同架构与超参数下的推理阶段性能。
- 残差网络的最优初始化方差依赖于深度与非线性性,与Xavier与He方案的假设存在根本差异。
- 本文推导出新型恒等式,将ReLU幂次的核与第二类修正贝塞尔函数的零阶函数关联起来。
- 实证结果证实,对于tanh残差网络,可训练性(梯度爆炸)主导性能;而对于(α-)ReLU网络,表征能力(输入距离)是主导因素。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。