QUICK REVIEW

[论文解读] Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Lei Wu, Zhanxing Zhu|arXiv (Cornell University)|Jun 30, 2017

Stochastic Gradient Optimization Techniques参考文献 19被引用 126

一句话总结

论文认为深度学习的泛化主要源于损失面几何，其中好的极小值占据较大的盆地，导致随机初始化进入它们；它为两层网络提供理论，并对更深的网络提供大量实证证据。

ABSTRACT

It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don't. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments.

研究动机与目标

Explain why deep neural networks generalize well despite over-parameterization.
Differentiate good minima from poor minima with the same training error.
Explain why optimization from random initialization tends to find good minima.
Tie empirical observations to theoretical landscape properties of loss functions.

提出的方法

Analyze loss landscapes using basin-of-attractor concepts from dynamical systems.
Develop a Hessian-based metric to quantify basin volume and solution complexity.
Theoretically analyze 2-layer networks to relate low-complexity solutions to small Hessian norms.
Provide empirical evidence on deeper nets via Hessian spectra and approximate Hessian norms.
Introduce an attack data setup to generate bad minima with the same training error but poor generalization.

实验结果

研究问题

RQ1What properties distinguish good (well-generalizing) minima from bad minima with the same training error?
RQ2Why do optimization methods with random initialization almost surely converge to good minima in deep networks?
RQ3How does the loss landscape's geometry influence the prevalence of good basins over bad basins?
RQ4To what extent do initialization, optimization dynamics, and landscape structure contribute to generalization?

主要发现

Good minima occupy large basins of attraction; the volume of these basins dominates over bad minima.
Random initialization places parameters in the good basin with overwhelming probability, leading to convergence to well-generalizing solutions.
Low-complexity solutions in 2-layer networks have small Hessian norms, indicating flat regions with large basins.
SGD alone is not the sole cause of good generalization; landscape structure largely governs outcomes.
Empirical Hessian spectrum analyses show good minima lie in wide valleys with many near-zero eigenvalues, while bad minima have larger eigenvalues and tighter valleys.
Spectral estimates of Hessian correlate with generalization performance in experiments on small and large networks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。