QUICK REVIEW

[論文レビュー] The Shattered Gradients Problem: If resnets are the answer, then what is the question?

David Balduzzi, Marcus Frean|arXiv (Cornell University)|Feb 28, 2017

Domain Adaptation and Few-Shot Learning参考文献 28被引用数 107

ひとこと要約

この論文は、深さが増すにつれて深い前方伝播ネットにおける shattered gradients 問題を特定し、skip-connections（ResNets）が勾配構造を保持することを示し、LL-init を導入して skip-connections なしで非常に深いネットワークを訓練できるようにする。

ABSTRACT

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

研究の動機と目的

Motivate and formalize the shattered gradients problem as gradients becoming like white noise with depth in deep rectifier networks.
Analyze how skip-connections (ResNets) alter gradient correlations and mitigate shattering.
Propose initialization strategies, including the looks-linear (LL-init), to prevent shattering and enable very deep training.

提案手法

Construct a minimal scalar-valued network to study gradient structure independent of data noise.
Derive and analyze gradient covariance and correlation under feedforward, ResNet, and highway architectures with and without batch normalization.
Use autocorrelation functions and a path-weight decomposition to explain how gradient structure changes with depth (Theorems 1–3).
Empirically validate gradient structure and shattering in convnets on CIFAR-10 and MNIST-like setups.
Propose and test the LL-init, including concatenated rectifiers and orthogonal convolutions, to enable training very deep networks without skip connections.

実験結果

リサーチクエスチョン

RQ1Do gradients in deep rectifier networks lose structure and resemble white noise as depth increases (shattering)?
RQ2How do skip-connections and batch normalization affect gradient correlations across depth?
RQ3Can initialization strategies like LL-init prevent shattering and allow training of very deep networks without skip connections?

主な発見

Gradients in deep feedforward nets become increasingly uncorrelated with depth, resembling white noise (Theorem 1).
Skip-connections in ResNets significantly preserve gradient structure, with correlations decaying sublinearly with depth (Theorem 3).
Batch normalization alters gradient correlation decay and can slow whitening but does not by itself fully eliminate shattering in feedforward nets (Theorem 3 vs. Theorem 2).
β-rescaling and BN reduce gradient whitening in ResNets, enabling much deeper networks to train than plain feedforward nets (Figure 2 and related discussion).
The LL-init can prevent shattering and, in extremely deep nets, allow training without skip connections (Figure 6).
Experiments on CIFAR-10 show ResNets maintain gradient structure at depth better than feedforward nets, especially with BN and β-rescaling.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。