QUICK REVIEW

[論文レビュー] Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci|arXiv (Cornell University)|Jun 14, 2017

Stochastic Gradient Optimization Techniques参考文献 28被引用数 168

ひとこと要約

tldr: オーバーパラメータライズドなニューラルネットワークの Hessian スペクトルを研究し、データに影響を受ける小さなセットのアウトライヤを伴うほぼゼロに近い固有値のバ bulk を示す。これらの特徴を過剰パラメータ化、フラット性、そして高次元非凸最適化における収束 basin に結びつける。

ABSTRACT

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

研究の動機と目的

Motivate and understand the geometry of deep learning loss surfaces via second-order (Hessian) analysis.
Characterize the spectrum of the Hessian and its decomposition into interpretable components.
Investigate how data complexity, model size, and optimization algorithm affect Hessian eigenvalues.
Provide implications for basins of attraction, flatness, and generalization in high-dimensional non-convex optimization.

提案手法

Compute the exact Hessian via Hessian-vector products at random initial points and after training.
Use the generalized Gauss-Newton decomposition to express the Hessian as a sum of a covariance-like term and a second term involving second derivatives (Equation 4).
Show that, near local minima, the Hessian is dominated by a rank-at-most-N term, implying many near-zero eigenvalues (Equation 5).
Experiment with varying data complexity by creating multi-cluster Gaussian datasets and training with SGD to observe the number of outlier eigenvalues reflecting class count.
Investigate the effect of over-parameterization by increasing network size while keeping data fixed and observing changes (or lack thereof) in the large-eigenvalue spectrum.
Compare the impact of optimization batch size on Hessian spectra by training with small vs large batches and analyzing outlier eigenvalues.
Examine negative eigenvalues at the bottom of the spectrum and their scaling with model size.

実験結果

リサーチクエスチョン

RQ1How does the Hessian spectrum of over-parameterized neural networks decompose into bulk and outliers, and what governs each part?
RQ2How do data complexity, model size, and optimization algorithm influence the large eigenvalues and overall Hessian geometry?
RQ3Are small-batch and large-batch optimizers associated with different basins, or do they lie in the same flat region?
RQ4What is the role of using the generalized Gauss-Newton decomposition in understanding the spectrum and flatness of the loss landscape?

主な発見

The Hessian spectrum splits into a bulk near zero and a few outliers located away from the bulk.
Increasing model size with fixed data does not change the number of large eigenvalues, supporting the idea that the bulk scales while outliers are data-driven.
More complex data (more clusters) increases the number of outliers, approximately matching the number of classes in some experiments.
Large-batch methods tend to have larger outlier eigenvalues than small-batch methods, indicating different local curvature along certain directions.
Negative eigenvalues exist at the bottom of the spectrum but are much smaller in magnitude than the positive outliers, suggesting the presence of non-minimal curvature even near training completion.
Two solutions found by large-batch and small-batch methods can lie in the same broad basin, connected by flat regions, challenging the notion of isolated basins.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。