QUICK REVIEW

[论文解读] Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

Zhang Shi-jun, Jianfeng Lu|arXiv (Cornell University)|Jul 13, 2023

Stochastic Gradient Optimization Techniques被引用 15

一句话总结

这篇论文提出一个统一近似框架，表明具有广泛激活函数的网络可以匹配 ReLU 网络的表达能力，给出明确的宽度-深度缩放因子，并识别在哪些情况下这些因子会进一步缩小。

ABSTRACT

This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}^2$, $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $3N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,$\,$depth) scaling factors can be further reduced from $(3,2)$ to $(1,1)$ if $\varrho$ falls within a specific subset of $\mathscr{A}$. This subset includes activation functions such as $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, and $\mathtt{Mish}$.

研究动机与目标

激发对深度神经网络在 ReLU 以外的表达能力的理解。
定义一个涵盖大多数常见激活函数的宽泛激活函数集合。
建立将非 ReLU 激活函数与 ReLU 网络在有界集合上的近似结果。
推导在某些激活族中可以降低宽度-深度需求。
讨论将基于 ReLU 的近似结果推广到其他激活函数的影响。

提出的方法

定义集合 A，涵盖 ReLU、Leaky ReLU、ReLU^2、ELU、GELU、SiLU、Swish、Mish、Sigmoid、Tanh、Arctan、Softsign、dSiLU、SRS，以及它们的平移/缩放/反射。
证明对任意 ρ∈A 和任意宽度 N、深度 L 的 ReLU 网络 φ_ReLU，在任意有界域 [-A,A]^d 上，存在宽度为 3N、深度为 2L 的 ρ-网络 φ_ρ，能够任意接近地逼近 φ_ReLU。
证明密度型结果：在有界集合上的极大范数下，N_N_ρ{3N,2L} 在 N_N_ReLU{N,L} 之上稠密。
给出推论，将 ReLU 的近似结果推广到集合 A 中的一般 Activation ρ，适用于如 C([0,1]^d)、C^s([0,1]^d) 和分段线性函数等函数类。
扩展到更高光滑度的情形（ρ^{(k)}），得到宽度-深度缩放为 (k+1)N，深度为 L（定理 6）。
在特殊情形下提供更精细的缩放：对 A_2 中的 ρ 为 (2,1)，对细化子集 ~A_2 中的 ρ 为 (1,1)，并给出详细对比（定理 8–9）。

实验结果

研究问题

RQ1是否存在一类较广的激活函数，能够在有界域上以可比的表达能力近似 ReLU 网络？
RQ2使用集合 A 中的激活函数近似 ReLU 网络所需的显式宽度-深度缩放因子是多少？
RQ3A 中更平滑或结构化的激活在普遍近似的宽度-深度要求方面有何影响？
RQ4是否存在允许更紧的（更小的）宽度-深度缩放的激活函数子集，它们与基于 ReLU 的结果相比如何？

主要发现

对于任意 ρ∈A，宽度为 N、深度为 L 的 ReLU 网络在任意有界域上可被宽度为 3N、深度为 2L 的 ρ-网络以任意精度逼近。
结果：ρ 激活网络至少具备 ReLU 网络的表达力，使将 ReLU 基于的近似结果迁移到许多激活函数成为可能。
若 ρ 属于更光滑的子集 A_2 或其细化子集 ~A_2，则宽度-深度缩放可分别降至 (2,1) 或 (1,1)，实现更紧的效率。
定理 6–9 将主要结论扩展到更高的光滑度（ρ∈C^k）以及专门的激活族，在某些情形下得到甚至更小的资源需求。
推论表明，对连续和光滑函数的标准近似结果（C([0,1]^d)、C^s([0,1]^d)）对 ρ-激活网络成立，常数有所调整。
表 1 总结了不同激活类别的宽度-深度权衡，表 2 给出在 ~A_2 中用不同 ρ 近似 ReLU 的代表性误差比较。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。