QUICK REVIEW

[论文解读] Deep Learning without Poor Local Minima

Kenji Kawaguchi|arXiv (Cornell University)|May 23, 2016

Sparse and Compressive Sensing Techniques参考文献 12被引用 23

一句话总结

该论文证明，对于深度线性神经网络，每一个局部最小值都是全局最小值，所有非全局临界点均为具有负特征值的鞍点，从而解决了长期存在的一个猜想。该研究进一步在独立性假设下将这些结果推广至深度非线性网络，表明由于不存在不良局部最小值，训练深度模型在理论上是可行的。

ABSTRACT

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

研究动机与目标

解决1989年关于深度线性神经网络优化景观的长期猜想。
解决COLT 2015年提出的关于深度非线性网络中不存在不良局部最小值的开放问题。
建立深度线性网络中每一个局部最小值都是全局最小值，且所有非全局临界点均为具有负特征值的鞍点的结论。
通过在独立性假设下将非线性网络约化为线性情形，将上述发现推广至深度非线性网络。
阐明尽管存在非凸性，深度学习优化仍具有理论可处理性。

提出的方法

分析任意深度和宽度的深度线性神经网络的平方损失函数。
利用矩阵分解和临界点分析来刻画损失景观，重点关注海森矩阵及其特征值结构。
应用先前研究中的独立性假设，将非线性网络约化为等价的线性模型以进行理论分析。
使用引理4.1和4.2，推导出临界点的精确表达式，其中涉及数据矩阵 $\Sigma = YX^T(XX^T)^{-1}XY^T$。
证明由于参数化效应，深层网络与浅层网络的临界点结构存在差异，从而否定了早期直观的坍缩论断。
证明不良鞍点（无负特征值）仅存在于更深的网络中（层数超过三层），而浅层网络中不存在此类点。

实验结果

研究问题

RQ1深度线性网络的平方损失的所有局部最小值是否也都是全局最小值？
RQ2深度线性网络中的非全局临界点在海森矩阵中是否具有负特征值，还是属于‘不良’鞍点？
RQ3在独立性假设下，能否通过约化为深度线性网络来分析深度非线性网络的优化景观？
RQ4网络深度在决定损失曲面上‘不良’鞍点（无负特征值）是否存在方面起什么作用？
RQ5为何基于模型表达能力的早期直观论断无法保持不同深度层级之间临界点结构的一致性？

主要发现

对于任意深度和宽度的深度线性网络，尽管损失函数是非凸且非凹的，每一个局部最小值都是全局最小值。
深度线性网络中所有非全局临界点均为鞍点，其海森矩阵至少有一个负特征值，但仅在更深的网络（层数 > 3）中存在‘不良’鞍点（无负特征值）。
不良鞍点仅存在于更深的网络中（H > 3），而浅层网络（H = 3）中不存在此类点，表明优化难度随深度变化。
该证明表明，基于模型表达能力和秩等价性的早期直观推理之所以失败，是因为不同参数化方式会产生不同的临界点结构。
对于深度非线性网络，在独立性假设下，其优化景观具有与线性情形相同的有利性质（无不良局部最小值，仅有鞍点）。
理论结果证实，由于不存在不良局部最小值，训练深度模型并不像NP难问题那样困难，尽管更深模型中的不良鞍点仍可能带来挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。