QUICK REVIEW

[论文解读] Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

Simon S. Du, Jason D. Lee|arXiv (Cornell University)|Dec 3, 2017

Adversarial Robustness in Machine Learning被引用 101

一句话总结

在高斯输入下，带权重归一化的梯度下降可以学习一个具有非重叠补丁的两层CNN，尽管存在一个虚假的局部极小值；多次随机重启可以将成功概率提升到很高。

ABSTRACT

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\\mathbf{Z}, \\mathbf{w}, \\mathbf{a}) = \\sum_j a_j\\sigma(\\mathbf{w}^T\\mathbf{Z}_j)$, in which both the convolutional weights $\\mathbf{w}$ and the output weights $\\mathbf{a}$ are parameters to be learned. When the labels are the outputs from a teacher network of the same architecture with fixed weights $(\\mathbf{w}^*, \\mathbf{a}^*)$, we prove that with Gaussian input $\\mathbf{Z}$, there is a spurious local minimizer. Surprisingly, in the presence of the spurious local minimizer, gradient descent with weight normalization from randomly initialized weights can still be proven to recover the true parameters with constant probability, which can be boosted to probability $1$ with multiple restarts. We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations.

研究动机与目标

激发对带有非重叠卷积层的两层CNN的学习动态的理解。
表征优化景观，包括虚假局部极小值的存在。
证明在高斯输入下，随机初始化的梯度下降可以恢复真实参数。
给出保证收敛的条件并量化收敛阶段。

提出的方法

将网络建模为 f(Z,w,a)=sum_i a_i sigma(w^T Z_i)，其中存在非重叠补丁且使用 ReLU 激活。
通过权重归一化重新参数化第一层：w = v / ||v||，并分析损失 ell(v,a)。
在高斯 Z 下推导总体损失和梯度表达式（定理 3.1 与 3.2）。
证明带初始化保证的梯度下降两阶段收敛性（定理 4.1 和 4.2）。
证明存在虚假局部极小值，并且某些初始化会收敛到它（定理 4.3）。
提供一个概率性初始化方案，在高概率下实现全局收敛，并讨论重启的作用。

实验结果

研究问题

RQ1在高斯输入下，随机初始化的梯度下降能否学习到一个单隐藏层CNN的真实权重？
RQ2目标函数是否存在虚假局部极小值，且梯度下降仍然能达到全局最小值？
RQ3初始化和两阶段动力学如何影响收敛速度和成功概率？

主要发现

存在某些初始化区域，在这些区域内梯度下降以常量概率收敛到教师参数；通过多次重启可以将概率提升到1。
在同一随机初始化方案下存在虚假局部极小值，且在某些条件下梯度下降会收敛到它。
优化动力学呈现两阶段：在取得足够进展后出现更快的线性收敛阶段。
分析给出明确的总体损失和梯度形式，取决于权重与真实权重之间的夹角，以及对 a^T a*。
在高斯输入下，结果暗示在给定适当的重启条件下，随机初始化的梯度下降具有多项式时间收敛保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。