[论文解读] Towards moderate overparameterization: global convergence guarantees for training shallow neural networks
论文证明,在具有平滑激活或ReLU的单隐藏层神经网络上,梯度下降(和SGD)在参数数量超过数据量一定因子时收敛到能完美拟合训练数据的全局最优解,在平滑情况下 kd^? ≥ n^2,ReLU 情况最高到 n^2/d,同时具有快速几何收敛速率。
Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).
研究动机与目标
- 为过参数化的浅层网络中,一阶方法达到全局收敛所需的过参数化水平进行动机阐述并量化。
- 证明在随机初始化下,梯度下降会几何收敛到能插值所有训练数据的全局最优解。
- 将结果扩展到 ReLU 激活以及 SGD,给出收敛性保证和速率。
- 通过证明中等过参数化就足以弥合理论与实践的差距,而不仅是极端宽的网络,来缩小理论与实践的差距。
提出的方法
- 分析一隐藏层网络 f(x;W)=v^T phi(Wx),其中 v 固定、对 W 进行训练,采用二次损失。
- 推导梯度下降和 SGD 的更新规则,并建立 kd 相对于 n 与数据属性的条件。
- 利用 Khatrio-Rao 和 Hadamard 乘积的谱性,以及随机矩阵理论,界定初始化时的雅可比矩阵谱。
- 证明几何收敛速率:||f(W_τ)-y||_2 以 (1 - c μ^2/B^2 …)^τ 的速率收敛,且在高概率下成立。
- 给出标准数据模型(如单位球面上的随机数据)的推论,说明 kd ≳ n^2 的尺度。
- 将结果扩展到 ReLU 激活,调整过参数化要求并给出类似的收敛性陈述。
实验结果
研究问题
- RQ1在浅层网络中,实现零训练误差所需的最小过参数化水平是多少?
- RQ2当 kd 相对于数据量超过一个常数因子时,随机初始化和一阶方法是否会收敛到全局最优解?
- RQ3平滑激活与 ReLU 激活在所需的过参数化和收敛速率上有何差异?
- RQ4SGD 的更新是否继承了对全批梯度下降观察到的全局收敛保证?
- RQ5这些结果对理论与实践在中等过参数化范畴的实际差距有何启示?
主要发现
- 当激活为平滑函数时,梯度下降在几何意义上收敛到能够完美拟合训练数据的全局最优解,仅在 sqrt(kd) ≥ c (B^2/μ_φ^2) (1+δ) κ(X) n 时成立。
- 对 ReLU 激活,若 sqrt(kd) 满足类似保证,则有 sqrt(kd) ≥ C (1+δ) n^2/d κ^3(X) σ_min^2(X*X)。
- 推论表明在随机数据设定下通常 sdkd ≳ n^2 的尺度即可;当 n ≲ d 时,上界简化为 k ≳ n,且收敛与维度无关。
- 随机初始化的 SGD 也能快速收敛到近似全局最优解,在与初始化保持高度接近的情况下,且在合适参数下的速率与 GD 相当。
- 数值实验展示在 n=kd 边界附近成功概率出现相变,表明实际的过参数化可能接近该阈值。
- 该工作将核方法般的随机特征直觉(k ≲ n)与在中等过参数化范畴内更深的优化保证联系起来。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。