[Paper Review] A Universal Approximation Theorem of Deep Neural Networks for Expressing Distributions.
This paper establishes a universal approximation theorem for deep neural networks in the context of probability distribution generation: under mild conditions, a ReLU network $g$ can be constructed such that the push-forward of a source measure $p_z$ via $\nabla g$ approximates any target distribution $\pi$ arbitrarily closely. The approximation error is bounded in terms of $1$-Wasserstein, MMD, and KSD discrepancies, with network size growing polynomially in dimension $d$ for MMD and KSD, but exponentially for $1$-Wasserstein.
This paper studies the universal approximation property of deep neural networks for representing probability distributions. Given a target distribution $\pi$ and a source distribution $p_z$ both defined on $\mathbb{R}^d$, we prove under some assumptions that there exists a deep neural network $g:\mathbb{R}^d ightarrow \mathbb{R}$ with ReLU activation such that the push-forward measure $( abla g)_\# p_z$ of $p_z$ under the map $ abla g$ is arbitrarily close to the target measure $\pi$. The closeness are measured by three classes of integral probability metrics between probability distributions: $1$-Wasserstein distance, maximum mean distance (MMD) and kernelized Stein discrepancy (KSD). We prove upper bounds for the size (width and depth) of the deep neural network in terms of the dimension $d$ and the approximation error $\varepsilon$ with respect to the three discrepancies. In particular, the size of neural network can grow exponentially in $d$ when $1$-Wasserstein distance is used as the discrepancy, whereas for both MMD and KSD the size of neural network only depends on $d$ at most polynomially. Our proof relies on convergence estimates of empirical measures under aforementioned discrepancies and semi-discrete optimal transport.
Motivation & Objective
- To establish a universal approximation property for deep neural networks in representing arbitrary probability distributions.
- To analyze the required size (width and depth) of a ReLU network to approximate a target distribution $\pi$ under various integral probability metrics.
- To compare the dependence of network size on dimension $d$ and approximation error $\varepsilon$ across different discrepancy measures.
- To show that for MMD and KSD, the network size grows at most polynomially in $d$, while for $1$-Wasserstein it grows exponentially.
Proposed method
- Construct a deep ReLU neural network $g: \mathbb{R}^d \to \mathbb{R}$ such that the push-forward of $p_z$ under $\nabla g$ approximates the target distribution $\pi$.
- Use convergence estimates of empirical measures under $1$-Wasserstein, MMD, and KSD to bound the approximation error.
- Leverage semi-discrete optimal transport theory to construct the gradient map $\nabla g$ that pushes $p_z$ toward $\pi$.
- Derive upper bounds on the width and depth of $g$ in terms of the dimension $d$ and the desired approximation error $\varepsilon$ for each discrepancy metric.
- Apply theoretical results on empirical measure convergence to control the discrepancy between $ (\nabla g)_\# p_z $ and $\pi$.
- Establish that the network size depends polynomially on $d$ for MMD and KSD, but exponentially for $1$-Wasserstein distance.
Experimental results
Research questions
- RQ1Can a deep ReLU neural network universally approximate any target probability distribution $\pi$ via push-forward of a source distribution $p_z$ under $\nabla g$?
- RQ2How does the required size of the network scale with respect to the dimension $d$ and the approximation error $\varepsilon$ when using $1$-Wasserstein distance?
- RQ3Does the network size grow polynomially or exponentially in $d$ when using MMD or KSD as the discrepancy measure?
- RQ4What theoretical guarantees can be derived for the approximation error in terms of integral probability metrics?
Key findings
- For the $1$-Wasserstein distance, the required deep neural network size grows exponentially in the dimension $d$ for a given approximation error $\varepsilon$.
- For both MMD and KSD, the network size depends on $d$ at most polynomially, indicating a significantly more favorable scaling than for $1$-Wasserstein.
- The paper proves the existence of a ReLU network $g$ such that $ (\nabla g)_\# p_z $ approximates $\pi$ to within $\varepsilon$ in all three discrepancy metrics.
- The construction relies on convergence estimates of empirical measures and semi-discrete optimal transport, which are used to bound the approximation error.
- The theoretical framework provides explicit upper bounds on the width and depth of the network in terms of $d$ and $\varepsilon$ for each discrepancy.
- The results establish a universal approximation property for deep networks in the context of distribution generation, with distinct scaling behaviors depending on the choice of discrepancy metric.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.