Skip to main content
QUICK REVIEW

[Paper Review] Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Behnam Neyshabur, Ruslan Salakhutdinov|arXiv (Cornell University)|Jun 8, 2015
Stochastic Gradient Optimization Techniques16 references164 citations
TL;DR

This paper proposes Path-SGD, an optimization method for deep neural networks that uses path-normalized gradient descent to achieve rescaling invariance—ensuring equivalent network function despite weight rescaling. By approximating steepest descent with respect to a path-wise regularizer inspired by max-norm regularization, Path-SGD outperforms SGD and AdaGrad in convergence speed and generalization, especially under unbalanced weight initialization.

ABSTRACT

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.

Motivation & Objective

  • To address the limitations of standard SGD in deep learning by rethinking the geometry of weight optimization.
  • To develop an optimization method invariant to rescaling of weights, which does not affect the network’s output function.
  • To improve training efficiency and generalization by aligning optimization geometry with the inductive bias of ReLU networks.
  • To demonstrate that path-normalized optimization leads to better implicit regularization than standard $β$-norm or weight decay.
  • To provide a practical, efficient alternative to SGD that can be easily integrated into existing training pipelines.

Proposed method

  • Proposes Path-SGD as an approximate steepest descent method with respect to a path regularizer derived from the minimum max-norm over all rescalings of the weights.
  • Defines rescaling invariance via transformations that multiply incoming weights and divide outgoing weights by a constant factor $c > 0$ at any hidden unit.
  • Introduces a path regularizer that computes the minimum possible max-norm across all such rescalings, ensuring invariance to weight rescaling.
  • Uses this regularizer to define a Riemannian-like geometry on the weight space, enabling steepest descent updates that are invariant to rescaling.
  • Implements Path-SGD efficiently by computing the path regularizer using dynamic programming over paths in the network graph.
  • Combines Path-SGD with adaptive step sizes (e.g., AdaGrad) and momentum, showing compatibility with existing optimization heuristics.

Experimental results

Research questions

  • RQ1Can a geometry for optimization in deep networks be designed that is invariant to rescaling of weights, since such rescalings do not affect the network’s function?
  • RQ2Does path-normalized optimization lead to faster convergence and better generalization compared to standard SGD and AdaGrad?
  • RQ3Can a regularizer based on the minimum max-norm over rescalings be efficiently computed and used in practice for training deep networks?
  • RQ4Does the implicit regularization induced by Path-SGD improve generalization, especially under poor or unbalanced weight initialization?
  • RQ5How does Path-SGD perform in comparison to SGD and AdaGrad when training deep networks with and without dropout?

Key findings

  • Path-SGD achieves faster convergence than SGD and AdaGrad across multiple benchmark datasets, including MNIST, CIFAR-10, CIFAR-100, and SVHN.
  • Under unbalanced weight initialization, Path-SGD maintains performance while SGD and AdaGrad suffer significant degradation in training and test error.
  • Path-SGD generalizes better than SGD and AdaGrad, with lower test error even when training error is zero, suggesting improved implicit regularization.
  • The method is numerically stable and produces identical optimization trajectories regardless of whether the network is initialized in a balanced or unbalanced manner.
  • Path-SGD outperforms baseline methods in both training speed and final generalization error, particularly in settings with dropout.
  • The method is compatible with adaptive step sizes and momentum, suggesting potential for further performance gains when combined with such techniques.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.