QUICK REVIEW

[论文解读] Theoretical properties of the global optimizer of two layer neural network

Digvijay Boob, Guanghui Lan|arXiv (Cornell University)|Oct 30, 2017

Neural Networks and Applications参考文献 13被引用 27

一句话总结

本文证明，对于具有可微、非分段线性激活函数的两层神经网络，当隐藏层非奇异时，一阶最优性意味着全局最优性。它证明了目标函数是Lipschitz光滑的，从而使得基于梯度的方法实现O(1/k)的收敛速度，并表明随机算法在有限次迭代中始终保持非奇异。

ABSTRACT

In this paper, we study the problem of optimizing a two-layer artificial neural network that best fits a training dataset. We look at this problem in the setting where the number of parameters is greater than the number of sampled points. We show that for a wide class of differentiable activation functions (this class involves "almost" all functions which are not piecewise linear), we have that first-order optimal solutions satisfy global optimality provided the hidden layer is non-singular. Our results are easily extended to hidden layers given by a flat matrix from that of a square matrix. Results are applicable even if network has more than one hidden layer provided all hidden layers satisfy non-singularity, all activations are from the given "good" class of differentiable functions and optimization is only with respect to the last hidden layer. We also study the smoothness properties of the objective function and show that it is actually Lipschitz smooth, i.e., its gradients do not change sharply. We use smoothness properties to guarantee asymptotic convergence of O(1/number of iterations) to a first-order optimal solution. We also show that our algorithm will maintain non-singularity of hidden layer for any finite number of iterations.

研究动机与目标

建立理论条件，以确定在何种情况下，两层神经网络中的一阶最优性意味着全局最优性。
分析神经网络目标函数的光滑性特性，特别是Lipschitz光滑性。
证明随机优化方法在有限次迭代中可保持隐藏层的非奇异性质。
在非凸、光滑目标函数下，推导基于梯度算法的收敛速率。
在隐藏层非奇异且激活函数非分段线性的约束下，将结果推广至深层网络。

提出的方法

证明对于可微、非分段线性激活函数，若隐藏层非奇异，则一阶最优解即为全局最优解。
证明目标函数是Lipschitz光滑的，即其梯度随参数微小扰动而缓慢变化。
利用光滑性性质，推导出梯度下降算法在找到ε-近似一阶最优解时，收敛速率为O(1/k)。
采用具有有界方差的随机梯度方法，以在有限次迭代中保持隐藏层的非奇异性质。
通过仅优化最后一层隐藏层，将结果推广至深层网络，同时确保隐藏层非奇异且激活函数属于“良好”函数类。
运用变分分析与矩阵扰动理论，分析梯度动态与收敛行为。

实验结果

研究问题

RQ1在具有非线性激活函数的两层神经网络中，一阶最优性在何种条件下意味着全局最优性？
RQ2两层神经网络的目标函数是否为Lipschitz光滑？这对优化有何影响？
RQ3随机优化方法是否能在有限次迭代中保持隐藏层的非奇异性质？
RQ4激活函数的选择如何影响一阶解的全局最优性？
RQ5在非凸、光滑的神经网络目标函数上，基于梯度的方法可保证何种收敛速率？

主要发现

对于一大类可微、非分段线性激活函数，若隐藏层非奇异，则一阶最优解即为全局最优解。
两层神经网络的目标函数是Lipschitz光滑的，确保梯度不会因参数更新而发生突变。
随机梯度下降在所有有限次迭代中均保持隐藏层非奇异，从而支持全局收敛性保证。
在Lipschitz光滑性条件下，寻找ε-近似一阶最优解的收敛速率可达O(1/k)。
梯度范数的期望值收敛界为O(1/N_o)，其显式依赖于初始目标值、半径R及方差参数。
当所有隐藏层均非奇异且所有激活函数均属于“良好”函数类时，结果可推广至深层网络。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。