Skip to main content
QUICK REVIEW

[Paper Review] Learning Functions: When Is Deep Better Than Shallow

H. N. Mhaskar, Qianli Liao|arXiv (Cornell University)|Mar 3, 2016
Domain Adaptation and Few-Shot Learning25 references105 citations
TL;DR

The paper proves that deep (hierarchical) networks can approximate compositional functions with the same accuracy as shallow networks but with exponentially fewer training parameters and smaller VC-dimension, addressing Bengio’s depth conjecture.

ABSTRACT

While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.

Motivation & Objective

  • Motivate the question of when depth provides advantages over shallow networks.
  • Quantitatively compare shallow and deep architectures for approximating compositional functions.
  • Establish approximation bounds showing parameter and VC-dimension savings for deep nets.
  • Link hierarchical compositional structure to practical scalable, shift-invariant deep convolutional networks.
  • Provide a framework connecting universal approximation with depth through binary-tree models and Gaussian networks.

Proposed method

  • Model deep networks as binary-tree hierarchies of ridge-function units.
  • Compare approximation power of shallow nets S_n and deep nets D_n for functions in corresponding smoothness classes.
  • Prove approximation rates: dist(f, S_n) = O(n^{-r/d}) for f in W_{r,d}^{NN} and dist(f, D_n) = O(n^{-r/2}) for f in W_{H,r,d}^{NN}.
  • Extend analysis to Gaussian networks and define function spaces W_{r,d}, K-functional based norms K_{r,d}(f,δ), and γ-smooth classes W_{\,}.
  • Derive VC-dimension bounds for shallow and binary-tree deep networks and relate to fat-shattering dimensions.

Experimental results

Research questions

  • RQ1When and why does depth give a quantitative advantage in approximating functions, particularly those with compositional structure?
  • RQ2How do approximation rates and parameter complexity scale for shallow vs. deep networks under smoothness assumptions?
  • RQ3Can hierarchical structure and shift-invariance (as in convolutional networks) be theoretically justified as natural for scalable algorithms?
  • RQ4What are the VC-dimension implications of deep hierarchical architectures compared to shallow ones?
  • RQ5Do Gaussian networks exhibit similar depth-related improvements under analogous assumptions?

Key findings

  • Deep networks can match shallow networks in approximation accuracy for compositional functions but with exponentially fewer parameters.
  • For general smooth functions, shallow networks require O(ε^{-d/r}) parameters to achieve accuracy ε, while deep networks matching compositional structure need only O(ε^{-2/r}) parameters.
  • Theorem results show faster decay of approximation error for deep, hierarchical structures relative to shallow ones under the same smoothness constraints.
  • VC-dimension bounds are tighter for deep binary-tree networks compared to shallow networks, reflecting reduced complexity.
  • A general framework shows scalable, shift-invariant deep convolutional networks are natural for compositional, multi-scale data like images.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.