Skip to main content
QUICK REVIEW

[Paper Review] PyHessian: Neural Networks Through the Lens of the Hessian

Zhewei Yao, Amir Gholami|arXiv (Cornell University)|Dec 16, 2019
Stochastic Gradient Optimization Techniques54 references33 citations
TL;DR

PyHessian is a scalable, open-source framework to compute Hessian-based statistics (top eigenvalues, trace, and spectral density) for deep nets, enabling analysis of loss-landscape topology and architectural effects such as batch normalization and residual connections. The paper uses this tool to reveal nuanced, sometimes counterintuitive, impacts of BN and residuals on trainability across CIFAR-10/100.

ABSTRACT

We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape smoother, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.

Motivation & Objective

  • Provide a scalable tool to compute Hessian information for large neural networks without forming the full Hessian.
  • Use Hessian-based analysis to study how architectural components like Batch Normalization and residual connections affect trainability and loss landscapes.
  • Offer empirical insights into when BN smooths or sharpens the loss landscape across different model depths.
  • Demonstrate distributed-memory execution to enable analysis on cloud or supercomputer systems.

Proposed method

  • Compute Hessian information using Hessian-vector products to avoid explicit Hessian formation via backpropagation-based matvec.
  • Estimate Hessian trace with Hutchinson’s randomized method using Hessian matvecs.
  • Compute full Hessian empirical spectral density via Stochastic Lanczos Quadrature (SLQ) and Lanczos iterations.
  • Analyze top Hessian eigenvalues, trace, and ESD for ResNet variants with/without Batch Normalization and residual connections on CIFAR-10/100.
  • Provide stage-wise and parametric loss-landscape visualizations by perturbing parameters along Hessian eigenvectors.

Experimental results

Research questions

  • RQ1How do Batch Normalization and residual connections influence the Hessian spectrum (top eigenvalue, trace, and ESD) during training?
  • RQ2Does removing BN or residual connections lead to smoother or sharper loss landscapes across different network depths?
  • RQ3Can Hessian-based diagnostics reveal fine-grained, stage-wise effects of architectural components on trainability and generalization?
  • RQ4Is PyHessian scalable to state-of-the-art deep nets using distributed memory on cloud or HPC systems?

Key findings

  • Removing BN can lead to a rapid increase in the Hessian spectrum, especially in deeper models, and BN is more critical in later stages of ResNet models.
  • Removing BN does not universally smooth the loss landscape; shallower networks may exhibit flatter Hessian spectra when BN is removed, while deeper networks show sharper spectra.
  • Removing residual connections generally increases the top eigenvalue, trace, and ESD support range, indicating a coarser loss landscape.
  • BN absence in deeper networks can cause convergence to sharp local minima with higher training loss and poorer generalization, whereas this is less pronounced in shallower models.
  • Stage-wise analysis shows that BN removal in later stages more strongly affects Hessian metrics and generalization, linking Hessian changes to accuracy drops.
  • PyHessian enables efficient, distributed Hessian analysis without forming the full Hessian, enabling insights into architecture-design questions about BN and residuals.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.