QUICK REVIEW

[论文解读] SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Alon Brutzkus, Amir Globerson|arXiv (Cornell University)|Oct 27, 2017

Neural Networks and Applications参考文献 22被引用 37

一句话总结

该论文证明了随机梯度下降（SGD）能够在过参数化的两层神经网络（采用Leaky ReLU激活函数）上，对线性可分数据实现良好的泛化性能，即使模型容量很高。论文建立了与网络规模无关的全局最小值收敛性和泛化边界，表明SGD的归纳偏置可防止过拟合，即使在过参数化的情况下亦然。

ABSTRACT

Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks. Specifically, we prove convergence rates of SGD to a global minimum and provide generalization guarantees for this global minimum that are independent of the network size. Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model. This is the first theoretical demonstration that SGD can avoid overfitting, when learning over-specified neural network classifiers.

研究动机与目标

解释为何SGD在过参数化的神经网络中仍能实现良好泛化，尽管模型容量很高。
弥合过参数化设置下经验成功与理论理解之间关于泛化的差距。
为使用SGD训练的过参数化网络提供可证明的泛化与优化保证。
证明SGD即使在神经网络容量足以记忆数据的情况下，也能避免过拟合。
分析线性可分数据与Leaky ReLU激活函数背景下SGD的归纳偏置。

提出的方法

该研究分析了一个具有Leaky ReLU激活函数的两层过参数化神经网络，且第二层权重固定为v = (1,…,1,−1,…,−1)。
使用随机梯度下降（SGD）优化独立同分布的线性可分数据的样本合页损失。
分析在数据和初始化的温和假设下，SGD收敛到全局最小值的速率。
建立了与网络宽度无关的泛化边界，表明对过参数化的鲁棒性。
理论证明依赖于构造局部最小值，并基于网络宽度和初始化分析SGD收敛到非全局最小值与全局最小值的概率。

实验结果

研究问题

RQ1SGD能否在过参数化的神经网络上对线性可分数据避免过拟合？
RQ2SGD的优化过程是否诱导出一种偏好低复杂度解的归纳偏置？
RQ3在何种条件下SGD能收敛到全局最小值而非较差的局部最小值？
RQ4网络宽度如何影响收敛到全局最小值与非全局最小值的概率？
RQ5能否在过参数化设置下推导出与网络规模无关的泛化边界？

主要发现

对于具有Leaky ReLU激活函数的过参数化网络，在线性可分数据上SGD能收敛到全局最小值。
泛化误差边界与网络宽度无关，证明了对过参数化的鲁棒性。
当网络足够宽时（k ≥ log₂(2d/δ)），SGD以高概率（≥1−δ）收敛到全局最小值。
当网络较窄时（k ≤ log₂(d/−ln(δ)))），SGD可能以高概率收敛到非全局最小值。
损失函数包含任意差的局部最小值，但当网络足够宽时SGD能避开这些最小值，展示了其归纳偏置。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。