Skip to main content
QUICK REVIEW

[論文レビュー] On weight initialization in deep neural networks

Siddharth Krishna Kumar|arXiv (Cornell University)|Apr 28, 2017
Adversarial Robustness in Machine Learning参考文献 2被引用数 157
ひとこと要約

この論文は非線形活性化を用いた重み初期化の理論を展開し、微分可能な活性化に対する一般戦略を導出し、RELUに対するHe初期化を証明する一方、XavierがRELUで失敗する理由を説明する。

ABSTRACT

A proper initialization of the weights in a neural network is critical to its convergence. Current insights into weight initialization come primarily from linear activation functions. In this paper, I develop a theory for weight initializations with non-linear activations. First, I derive a general weight initialization strategy for any neural network using activation functions differentiable at 0. Next, I derive the weight initialization strategy for the Rectified Linear Unit (RELU), and provide theoretical insights into why the Xavier initialization is a poor choice with RELU activations. My analysis provides a clear demonstration of the role of non-linearities in determining the proper weight initializations.

研究の動機と目的

  • Generalize Xavier-style variance propagation to activations differentiable at 0.
  • Derive a weight initialization strategy for differentiable activations.
  • Provide a rigorous proof of He initialization for RELU.
  • Explain why Xavier initialization fails with RELU.
  • Discuss activation distribution effects on forward-pass dynamics.

提案手法

  • Model the forward pass of a deep network with Gaussian-initialized weights and i.i.d. inputs.
  • Use a Taylor expansion around 0 for activations differentiable at 0 to relate layer variances.
  • Derive s_q^2 recursion: s_{m+1}^2 ≈ (g'(0))^2 N v^2 (s_m^2 + μ_m^2).
  • Obtain v^2 = 1 / (N (g'(0))^2 (1+g(0)^2)) for differentiable activations at 0.
  • Specialize to tanh and sigmoid to show Xavier-like results (v^2 ≈ 1/N for tanh, v^2 ≈ ~3.6/√N for sigmoid).
  • For non-differentiable activations (RELU), compute μ and s^2 to show v^2 ≈ 2/N (He initialization).

実験結果

リサーチクエスチョン

  • RQ1How should weights be initialized to keep layer input variance stable across deep networks with non-linear activations?
  • RQ2What are the appropriate initialization scales for differentiable activations and for RELU?
  • RQ3Why does Xavier initialization fail for RELU, and how does He initialization remedy this?
  • RQ4How do non-linearities affect the distribution and variance of layer pre-activations and activations?
  • RQ5Can a unified framework connect Xavier and He initializations across activation types?

主な発見

  • A general initialization formula v^2 = 1 / (N (g'(0))^2 (1+g(0)^2)) for activations differentiable at 0.
  • For tanh (g(0)=0, g'(0)=1), v^2 ≈ 1/N, recovering Xavier initialization.
  • For sigmoid (g(0)=0.5, g'(0)=1/4), v^2 ≈ 3.6/√N.
  • For RELU (non-differentiable at 0), He initialization gives v^2 ≈ 2/N via variance maintenance.
  • Xavier initialization leads to vanishing variance in deeper layers for RELU, explaining convergence issues in very deep nets.
  • The 30-layer network example supports He initialization over Xavier for RELU.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。