[Paper Review] Benefits of depth in neural networks
The paper proves depth-based advantages: there exist deep networks with modest size that cannot be approximated by shallow networks without exponential growth, using semi-algebraic gates and ReLU-based networks.
For any positive integer $k$, there exist neural networks with $Θ(k^3)$ layers, $Θ(1)$ nodes per layer, and $Θ(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially large --- they must possess $Ω(2^k)$ nodes. This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with ReLU and maximization gates, sum-product networks, and boosted decision trees (in this last case with a stronger separation: $Ω(2^{k^3})$ total tree nodes are required).
Motivation & Objective
- Demonstrate that deep networks can express highly oscillatory functions that shallow networks struggle to approximate.
- Show how oscillation-based counting separates deep from shallow networks using semi-algebraic gates.
- Extend depth hierarchy insights to architectures like convolutional nets, sum-product networks, and boosted decision trees.
Proposed method
- Construct a specific target function requiring many layers to approximate, using ReLU gates.
- Define and analyze semi-algebraic gates to encompass common activations (ReLU, max, piecewise polynomials).
- Use oscillation (crossing) counts to relate depth to function complexity and approximation limits.
- Prove bounds on oscillations under composition vs. addition of layers, leading to a depth separation result.
- Employ a counting/packing argument to show inapproximability of the deep target by shallow nets with limited size.
Experimental results
Research questions
- RQ1Can deep neural networks provably represent functions that shallow networks cannot approximate without exponential size?
- RQ2How do oscillation growth and composition vs. addition of layers contribute to depth separation across architectures?
- RQ3Do depth-based separations extend to semi-algebraic networks and architectures like CNNs, sum-product networks, and boosted trees?
Key findings
- There exist networks with 2k^3+8 layers, 3k^3+12 total nodes, and 4+d distinct parameters that cannot be approximated within 1/64 L1 error by networks with O(k) layers and subexponential node counts.
- A deeper network can generate exponentially more oscillations than a shallow one, enabling highly oscillatory target functions to resist shallow approximation.
- The depth separation also holds for semi-algebraic gate networks, including ReLU-based, max-gating CNNs, and boosted decision trees under stronger node-count requirements (Ω(2^{k^3}) total nodes).
- Companion results bound the VC dimension of semi-algebraic networks, showing that most random labelings are not well-approximated by deep networks with constrained parameters.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.