QUICK REVIEW

[论文解读] Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci|arXiv (Cornell University)|Jun 14, 2017

Stochastic Gradient Optimization Techniques参考文献 28被引用 168

一句话总结

本文研究了过参数化神经网络的 Hessian 谱，发现大多数特征值接近零，少数离群值受数据影响；并将这些特征与过参数化、平坦性以及高维非凸优化中的吸引盆地联系起来。

ABSTRACT

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

研究动机与目标

通过二阶（Hessian）分析来激发和理解深度学习损失表面的几何特征。
表征 Hessian 的谱及其可解释组成的分解。
研究数据复杂度、模型规模和优化算法如何影响 Hessian 的特征值。
为高维非凸优化中的吸引盆地、平坦性和泛化提供意义。

提出的方法

在随机初始点和训练后通过 Hessian-向量乘积计算精确的 Hessian。
使用广义高斯-牛顿分解将 Hessian 表达为一个协方差样项与包含二阶导数的第二项之和（方程式 4）。
证明在局部极小值附近，Hessian 受至多为 N 维秩的项主导，从而意味着存在大量接近零的特征值（方程式 5）。
通过创建多簇高斯数据集并使用 SGD 训练来观测反映类别数的离群特征值数量，从而在数据复杂度上进行取样实验。
通过在固定数据下增大网络规模来研究过参数化的影响，并观察大特征值谱的变化（或缺乏变化）。
通过用小批量与大批量训练并分析离群特征值来比较优化批量大小对 Hessian 谱的影响。
检查谱底部的负特征值及其随模型规模的缩放关系。

实验结果

研究问题

RQ1过参数化神经网络的 Hessian 谱如何分解为大块与离群值，它们各自受哪些因素支配？
RQ2数据复杂度、模型规模和优化算法如何影响大特征值及整个 Hessian 的几何？
RQ3小批量和大批量优化器是否对应不同的盆地，还是位于同一平坦区域？
RQ4在理解损失景观的谱以及平坦性方面，使用广义高斯-牛顿分解的作用是什么？

主要发现

Hessian 谱分裂为接近零的大块和位于大块之外的少数离群值。
在固定数据下增大模型规模不会改变大型特征值的数量，支持大块随规模增长而扩展，而离群值取决于数据。
更复杂的数据（更多簇）增加离群值数量，在某些实验中大致与类别数相匹配。
大批量方法的离群特征值往往大于小批量方法，表明在某些方向上局部曲率不同。
谱底部存在负特征值，但幅度远小于正离群值，暗示在接近训练完成时仍存在非极小化曲率。
大批量与小批量方法找到的两个解可以位于同一个广义盆地中，通过平坦区域相连，挑战独立盆地的概念。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。