QUICK REVIEW

[Paper Review] Identity Matters in Deep Learning

Moritz Hardt, Tengyu Ma|arXiv (Cornell University)|Nov 14, 2016

Adversarial Robustness in Machine Learning76 citations

TL;DR

This paper establishes that identity parameterization—where residual blocks can represent the identity function when weights are zero—significantly improves optimization and expressivity in deep learning. The authors prove that deep linear residual networks have no spurious local optima and that ReLU-based residual networks can universally express any function on finite datasets with sufficient parameters, leading to state-of-the-art performance in all-convolutional models on CIFAR and ImageNet without batch normalization or dropout.

ABSTRACT

An emerging design principle in deep learning is that each layer of a deep artificial neural network should be able to easily express the identity transformation. This idea not only motivated various normalization techniques, such as \emph{batch normalization}, but was also key to the immense success of \emph{residual networks}. In this work, we put the principle of \emph{identity parameterization} on a more solid theoretical footing alongside further empirical progress. We first give a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima. The same result for linear feed-forward networks in their standard parameterization is substantially more delicate. Second, we show that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size. Directly inspired by our theory, we experiment with a radically simple residual architecture consisting of only residual convolutional layers and ReLu activations, but no batch normalization, dropout, or max pool. Our model improves significantly on previous all-convolutional networks on the CIFAR10, CIFAR100, and ImageNet classification benchmarks.

Motivation & Objective

To theoretically justify the design principle of identity parameterization in deep residual networks.
To show that residual networks with ReLU activations can universally represent any function on a finite dataset when model size exceeds sample size.
To demonstrate that simple all-convolutional residual networks without batch normalization or dropout can achieve state-of-the-art performance.
To bridge theory and practice by deriving architectural principles from optimization and expressivity guarantees.
To simplify deep learning architectures by reducing reliance on regularization tricks like batch normalization and dropout.

Proposed method

Prove that deep linear residual networks have no spurious local optima by showing gradients vanish only at the global optimum when weight matrices have small spectral norm.
Use a factored parameterization of the form $(I + A_\ell)\cdots(I + A_1)$ to enable identity representation at zero weights.
Construct a universal finite-sample expressivity proof for ReLU residual networks by showing they can represent any function on $n$ samples with $O(n\log n + r^2)$ parameters.
Design a minimal all-convolutional architecture using only residual convolutions and ReLU activations, with no batch normalization, dropout, or pooling layers.
Train the model using standard optimization (momentum SGD) with data augmentation, relying solely on depth and skip connections for performance.
Evaluate the model on CIFAR-10, CIFAR-100, and ImageNet benchmarks to compare against prior all-convolutional and residual architectures.

Experimental results

Research questions

RQ1Can identity parameterization in residual networks eliminate spurious local optima in deep linear networks?
RQ2Can ReLU-based residual networks universally express any function on a finite dataset with sufficient model capacity?
RQ3Can a minimal all-convolutional architecture without batch normalization or dropout achieve state-of-the-art performance on image classification benchmarks?
RQ4Does the absence of optimization barriers in identity-parameterized networks translate to better generalization and training stability?
RQ5Can the theoretical benefits of identity parameterization be realized in practice with simple, clean architectures?

Key findings

Deep linear residual networks have no spurious local optima: gradients vanish only at the global optimum when all weight matrices have spectral norm $O(1/\ell)$, ensuring convergence to the optimal solution.
For any linear transformation $R$ with $\det(R) > 0$, there exists a global optimizer in the residual parameterization where each $\|A_i\| \leq O(1/\ell)$, implying small-norm solutions exist at large depth.
ReLU-based residual networks have universal finite-sample expressivity: they can represent any function on $n$ samples with $O(n\log n + r^2)$ parameters, where $r$ is the number of classes.
An all-convolutional residual model without batch normalization or dropout achieved $6.38\%$ top-1 error on CIFAR-10 and $24.64\%$ on CIFAR-100, outperforming prior all-convolutional models.
On ImageNet, the same architecture achieved $35.29\%$ top-1 error, significantly better than prior all-convolutional models and competitive despite underfitting, suggesting potential for further improvement with hyperparameter tuning.
The model generalizes well despite having 13.5 million parameters on CIFAR-10, indicating that identity parameterization supports generalization without explicit regularization.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.