QUICK REVIEW

[论文解读] Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks.

Santosh Vempala, John Wilmes|arXiv (Cornell University)|May 7, 2018

Stochastic Gradient Optimization Techniques参考文献 27被引用 21

一句话总结

该论文证明，使用包含 $n^{O(k)}$ 个参数和迭代次数的单隐层神经网络，梯度下降能够收敛到有界目标函数在 $n$ 个输入上的最优次数不超过 $k$ 的多项式逼近。关键结果表明，来自 ReLU 和 Sigmoid 等类别的随机门可以使用 $n^{O(k)} \cdot \text{poly}(1/\epsilon)$ 个随机选择的门，将任意函数逼近到与最优次数-$k$ 多项式逼近的误差 $\epsilon_0 + \epsilon$ 以内。

ABSTRACT

We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, the size of the network and number of iterations needed are both bounded by $n^{O(k)}$. The core of our analysis is the following existence theorem, which is of independent interest: for any $\epsilon > 0$, any bounded function that has a degree-$k$ polynomial approximation with error $\epsilon_0$ (in 2-norm), can be approximated to within error $\epsilon_0 + \epsilon$ as a linear combination of $n^{O(k)} \mbox{poly}(1/\epsilon)$ randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to $k$. In particular, this applies to training networks of unbiased sigmoids and ReLUs.

研究动机与目标

分析在有界目标函数上训练单隐层神经网络时梯度下降的收敛性。
建立 GD 收敛到次数为 $k$ 的多项式在 2-范数下的最优逼近的结论。
证明网络规模和迭代次数均随 $n^{O(k)}$ 变化，以实现精确逼近。
证明一个一般性存在性定理：可使用来自 ReLU 和 Sigmoid 等类别的随机非线性门逼近函数。
证明此类门可实现与最优次数-$k$ 多项式逼近误差 $\epsilon_0 + \epsilon$ 以内的逼近。

提出的方法

该分析依赖于一个新颖的存在性定理：任何具有次数-$k$ 多项式逼近误差 $\epsilon_0$ 的有界函数，可使用 $n^{O(k)} \cdot \text{poly}(1/\epsilon)$ 个随机选择的门，逼近到误差 $\epsilon_0 + \epsilon$ 以内。
该方法利用调和分析，表明在傅里叶展开中，其系数在次数 $k$ 以内非零的激活函数可支持此类逼近。
利用测度集中性和随机矩阵理论，以界定实现逼近所需的随机门数量。
通过基于存在性定理的稳定性与逼近性论证，建立梯度下降的收敛性。
网络通过随机初始化，且证明 GD 会收敛到由次数-$k$ 多项式逼近可达到的最小误差。
该证明利用了门集合在次数-$k$ 多项式空间中张成的空间足够稠密，从而实现高效学习。

实验结果

研究问题

RQ1在单隐层神经网络上训练的梯度下降能否收敛到有界目标函数的最优次数-$k$ 多项式逼近？
RQ2实现此类收敛所需的网络规模和迭代次数是多少？
RQ3来自 ReLU 和 Sigmoid 等类别的随机非线性门能否将任意函数逼近到与最优次数-$k$ 多项式逼近误差 $\epsilon_0 + \epsilon$ 以内的范围？
RQ4需要多少随机门才能实现此类逼近精度？
RQ5收敛性是否依赖于激活函数的调和展开特性？

主要发现

梯度下降收敛到目标函数的次数-$k$ 多项式逼近可达到的最小 2-范数误差。
所需参数数量和迭代次数均被限制在 $n^{O(k)}$ 以内，且与目标函数的复杂度无关，仅取决于其次数-$k$ 逼近误差。
实现的逼近误差至多为 $\epsilon_0 + \epsilon$，其中 $\epsilon_0$ 是最优次数-$k$ 多项式逼近的 2-范数误差。
该结果适用于所有激活函数在次数 $k$ 以内具有非零调和系数的非线性门类别，包括 ReLU 和 Sigmoid 网络。
为实现 $\epsilon$-接近逼近，所需随机门的数量为 $n^{O(k)} \cdot \text{poly}(1/\epsilon)$，对于固定的 $k$，该数量在 $1/\epsilon$ 上为多项式关系。
该分析建立了通用存在性结果：此类类别的随机门可张成一个在次数-$k$ 多项式空间中稠密至误差 $\epsilon$ 的空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。