QUICK REVIEW

[Paper Review] Random Walk Initialization for Training Very Deep Feedforward Networks

David Sussillo, L. F. Abbott|arXiv (Cornell University)|Dec 19, 2014

Stochastic Gradient Optimization Techniques7 references70 citations

TL;DR

This paper proposes Random Walk Initialization (RW-I), a novel weight initialization scheme for very deep feedforward networks that stabilizes gradient flow by ensuring the log-norm of back-propagated error gradients performs an unbiased random walk. By analytically deriving optimal scaling factors $ g $, the method limits gradient norm fluctuations to grow only as the square root of depth, enabling successful training of networks up to 1000 layers on MNIST and TIMIT with near-zero training error.

ABSTRACT

Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.

Motivation & Objective

To address the vanishing gradient problem in very deep feedforward networks (FFNs), which traditionally hinders training of networks beyond a few dozen layers.
To analyze how successive applications of random weight matrices during back-propagation affect gradient norm evolution in FFNs, contrasting with recurrent networks.
To derive a principled initialization method—Random Walk Initialization—that ensures the log-norm of gradients performs an unbiased random walk, minimizing exponential gradient decay or explosion.
To empirically validate the method on real-world datasets (MNIST, TIMIT) using stochastic gradient descent, demonstrating feasibility of training extremely deep networks.

Proposed method

Proposes a random matrix model where each layer applies an i.i.d. Gaussian weight matrix with variance $ 1/N $, scaled by a factor $ g $, to simulate gradient back-propagation dynamics.
Analyzes the evolution of the log-norm of the error gradient vector as a random walk, deriving the condition for an unbiased walk by balancing growth and decay rates.
Derives analytical expressions for the optimal $ g $: $ g = \sqrt{2 / (1 + \text{var}(f'(a)))} $ for ReLU networks and $ g = \sqrt{2 / \text{tr}(\mathbf{W}^T \mathbf{W})} $ for linear networks, ensuring stable gradient norm scaling.
Employs stochastic gradient descent with fixed parameter limits across depths to train networks of varying depth (up to 1000 layers), using $ g $ values derived from theory.
Uses a log-linear plot of training error vs. depth to visualize gradient stability and performance across hyper-parameters like $ \lambda_{in} $, $ \lambda_{out} $, and $ g $.
Validates the method on both classification (MNIST) and autoencoder (MNIST, TIMIT) tasks, showing consistent performance across depths when $ g $ is correctly set.

Experimental results

Research questions

RQ1Does the gradient norm in very deep feedforward networks grow or decay exponentially with depth, as in recurrent networks?
RQ2Can the back-propagated gradient norm be stabilized in deep feedforward networks by controlling the scaling of random weight matrices?
RQ3What is the optimal scaling factor $ g $ that results in an unbiased random walk of the log-gradient norm, minimizing variance growth with depth?
RQ4Can very deep feedforward networks (e.g., 1000 layers) be successfully trained on real-world datasets using this initialization scheme?

Key findings

The log-norm of the back-propagated error gradient in deep feedforward networks performs an unbiased random walk when the weight scaling factor $ g $ is chosen appropriately, with variance growing linearly with depth and inversely with layer width $ N $.
The gradient norm scales with the square root of network depth, not exponentially, meaning the vanishing gradient problem is significantly mitigated in properly initialized deep FFNs.
For ReLU networks, the optimal $ g $ is $ \sqrt{2 / (1 + \text{var}(f'(a)))} $, which simplifies to $ \sqrt{2} $ when $ f'(a) = 1 $ at initialization, ensuring unbiased random walk behavior.
Experiments on MNIST with 1000-layer networks achieved a training error of about 50 mistakes using Random Walk Initialization, demonstrating feasibility of training such deep networks.
On the TIMIT dataset, the best performance was achieved at depth 16, with depth 32 nearly tied, indicating no clear benefit from increased depth, but successful training was still possible with proper initialization.
The method remains effective even with first-order optimization (SGD), though learning rate scheduling and curvature issues become critical in extremely deep networks (e.g., 1000 layers), requiring $ g > 1 $ to stabilize training.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.