Skip to main content
QUICK REVIEW

[论文解读] Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Jeffrey Pennington, Samuel S. Schoenholz|arXiv (Cornell University)|Nov 13, 2017
Model Reduction and Neural Networks参考文献 10被引用 69
一句话总结

本文分析深度网络中的全奇异值分布,显示 ReLU 无法维持 dynamical isometry,而正交 sigmoidal 网络可以实现,从而在实践中显著加快学习并提升泛化。

ABSTRACT

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

研究动机与目标

  • Understand how the entire singular value distribution of the input-output Jacobian depends on depth, weight initialization, and nonlinearity.
  • Identify which combinations of initialization and nonlinearity can achieve dynamical isometry (all singular values near 1).
  • Quantify how dynamical isometry correlates with learning speed and generalization in deep nonlinear networks.
  • Provide practical guidance for network design and initialization to improve training efficiency.

提出的方法

  • Compute the full singular value density of the input-output Jacobian J in the large-width limit using free probability and S-transform techniques.
  • Derive expressions for the S-transforms of WW^T and D^2 for various nonlinearity shapes and weight ensembles (Gaussian and orthogonal).
  • Analyze linear, ReLU, and hard-tanh networks to compare dynamical isometry prospects.
  • Relate the spectrum of J J^T to training dynamics via metrics such as the maximum eigenvalue and the variance of the eigenvalue distribution.
  • Validate theoretical predictions with numerical simulations and CIFAR-10 experiments to assess learning speed under different initializations.

实验结果

研究问题

  • RQ1How does the entire distribution of Jacobian singular values depend on network depth, weight statistics, and nonlinearity?
  • RQ2What weight initialization and nonlinearity combinations can achieve dynamical isometry (all singular values near 1)?
  • RQ3Do nonlinear networks that achieve dynamical isometry learn faster than those that do not, and how does this depend on architecture and optimization?

主要发现

  • ReLU networks cannot achieve dynamical isometry; their Jacobian spectra remain ill-conditioned at depth.
  • Orthogonal sigmoidal networks can achieve dynamical isometry, with the max singular value staying O(1) as depth grows, unlike Gaussian or ReLU cases.
  • Gaussian initializations fail to maintain dynamical isometry even at criticality, as the max eigenvalue and spectral variance grow with depth.
  • For orthogonal hard-tanh networks, dynamical isometry can be approached by lowering q* (variance at the fixed point), increasing the linear-regime fraction p(q*).
  • Empirical results show orthogonal tanh networks train orders of magnitude faster than ReLU networks on CIFAR-10, with learning time scaling sublinearly with depth (approximately O(sqrt(L))).
  • Dynamical isometry at initialization can persist for a substantial portion of training, and some nonzero initial q* may optimize both learning speed and generalization.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。