QUICK REVIEW

[Paper Review] The Shattered Gradients Problem: If resnets are the answer, then what is the question?

David Balduzzi, Marcus Frean|arXiv (Cornell University)|Feb 27, 2017

Advanced Neural Network Applications32 references169 citations

TL;DR

The paper defines and analyzes the shattered gradients problem in deep rectifier networks, showing gradients become like white noise with depth in feedforward nets, while skip connections (ResNets) preserve gradient structure; it also proposes the LL-init to train very deep nets without skip connections.

ABSTRACT

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

Motivation & Objective

Motivate the study of gradient structure in very deep rectifier networks beyond vanishing/exploding gradients.
Characterize how gradient correlations degrade with depth in feedforward nets vs. skip-connected architectures.
Demonstrate empirically the gradient structure in fully-connected nets and convnets at initialization.
Propose initialization and architectural strategies (LL-init, batch norm, β-rescaling) to mitigate shattering.
Provide practical guidance for training very deep networks without sacrifices in gradient quality.

Proposed method

Construct a minimal scalar-to-scalar network with 200 rectifier neurons per hidden layer to isolate gradient behavior.
Analyze gradients as a function of input 1D grid and compute gradient covariance and autocorrelation across depth.
Derive theoretical results (theorems) describing how gradient covariance decays with depth in feedforward nets and ResNets.
Empirically validate gradient structure in fully-connected nets and ConvNets on CIFAR-10/real data, using batch normalization and various depths.
Introduce the looks-like-linear initialization (LL-init) and orthogonal convolutional kernels, and test on very deep networks.
Compare gradient structure with and without skip connections, and with/without batch normalization and β-rescaling.

Experimental results

Research questions

RQ1How does the correlation structure of gradients change with depth in standard feedforward rectifier networks compared to residual networks?
RQ2Do skip-connections (ResNets) preserve gradient structure and prevent shattering at initialization and during early training?
RQ3What role does batch normalization and β-rescaling play in the gradient correlation structure of deep nets?
RQ4Can a initialization strategy that avoids shattering (LL-init) enable training of very deep networks without skip connections?
RQ5Do the observed gradient phenomena extend from fully connected nets to convolutional nets on real datasets?

Key findings

Gradients in deep feedforward rectifier networks resemble white noise as depth increases, with gradient correlations decaying exponentially in depth.
Skip-connections in ResNets significantly slow gradient whitening, preserving structure and making training of very deep networks feasible.
Batch normalization alters gradient structure: it keeps neurons active and controls spatial activation patterns, affecting gradient correlations.
β-rescaling in ResNets (β in [0.1,0.3]) further reduces gradient whitening, leading to slower decay of gradient correlations with depth.
A looks-linear initialization (LL-init) can enable training of very deep networks without skip connections, achieving comparable performance to ResNets in CIFAR-10 experiments.
Empirical results on CIFAR-10 and convnets show gradient whitening is mitigated in ResNets, and LL-init with orthogonal kernels can train deep nets beyond what standard initializations allow.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.