Skip to main content
QUICK REVIEW

[论文解读] Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization

Navid Azizan, Sahin Lale|arXiv (Cornell University)|Jun 10, 2019
Domain Adaptation and Few-Shot Learning参考文献 38被引用 29
一句话总结

本文研究了在过参数化非线性模型中的随机镜像下降(SMD),表明SMD会收敛到与初始化在镜像势对应的Bregman散度下最近的全局最小值。令人惊讶的是,实验结果揭示,当 $ψ(q)=\|\cdot\|_q^q$ 且 $q=10$ 时,其泛化性能优于 $q=2$(SGD)或 $q=1$,尽管其诱导的稀疏性更弱,凸显了隐式正则化在深度学习泛化中的关键作用。

ABSTRACT

Most modern learning problems are highly overparameterized, meaning that there are many more parameters than the number of training data points, and as a result, the training loss may have infinitely many global minima (parameter vectors that perfectly interpolate the training data). Therefore, it is important to understand which interpolating solutions we converge to, how they depend on the initialization point and the learning algorithm, and whether they lead to different generalization performances. In this paper, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which the popular stochastic gradient descent (SGD) is a special case. Our contributions are both theoretical and experimental. On the theory side, we show that in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima (something that comes for free in the highly overparameterized case), SMD with sufficiently small step size converges to a global minimum that is approximately the closest one in Bregman divergence. On the experimental side, our extensive experiments on standard datasets and models, using various initializations, various mirror descents, and various Bregman divergences, consistently confirms that this phenomenon happens in deep learning. Our experiments further indicate that there is a clear difference in the generalization performance of the solutions obtained by different SMD algorithms. Experimenting on a standard image dataset and network architecture with SMD with different kinds of implicit regularization, $\ell_1$ to encourage sparsity, $\ell_2$ yielding SGD, and $\ell_{10}$ to discourage large components in the parameter vector, consistently and definitively shows that $\ell_{10}$-SMD has better generalization performance than SGD, which in turn has better generalization performance than $\ell_1$-SMD.

研究动机与目标

  • 理解在过参数化非线性模型中,随机镜像下降(SMD)收敛于哪个全局最小值。
  • 研究镜像势(定义Bregman散度)的选择如何影响隐式正则化与泛化性能。
  • 确定尽管训练损失相同,不同SMD算法是否会产生不同的泛化性能。
  • 通过在标准数据集和架构上的系统性实验,验证理论预测的收敛行为。

提出的方法

  • 理论分析表明,当步长较小时,SMD会收敛到由镜像势诱导的Bregman散度下,与初始化最近的全局最小值。
  • 该分析适用于过参数化非线性模型,其中由于高维性,初始化自然接近全局最小值流形。
  • 实验使用MNIST和CIFAR-10数据集,采用ResNet-18架构,从不同初始化开始,训练不同镜像势($\ell_1$、$\ell_2$、$\ell_3$、$\ell_{10}$)的SMD,直至达到零训练误差。
  • 通过测量最终解与初始点之间的成对Bregman散度距离,验证理论预测的最近收敛行为。
  • 分析权重分布直方图,评估不同镜像下参数幅度的稀疏性与变化。
  • 通过CIFAR-10上的测试准确率评估泛化性能,比较不同SMD变体在相同训练损失下的表现。

实验结果

研究问题

  • RQ1随机镜像下降(SMD)是否收敛到由镜像势定义的Bregman散度下,与初始化最近的全局最小值?
  • RQ2镜像势的选择如何影响深度神经网络的隐式正则化与泛化性能?
  • RQ3在实践中,SMD对最近最小值的收敛行为是否在不同初始化和镜像类型下保持一致?
  • RQ4为何 $\ell_{10}$-SMD 在泛化性能上优于 $\ell_2$-SMD(SGD)和 $\ell_1$-SMD,尽管其诱导的稀疏性更弱?
  • RQ5SMD的隐式正则化效应是否可系统性地利用以提升深度学习的测试性能?

主要发现

  • 在所有实验中,任何SMD算法获得的最终解在由镜像势定义的Bregman散度下,均与对应初始化最近,证实了理论预测。
  • $\ell_{10}$-SMD在CIFAR-10上实现了最佳泛化性能,优于 $\ell_2$-SMD(SGD)和 $\ell_1$-SMD。
  • $\ell_1$-SMD显著诱导了最终权重的稀疏性,权重幅度直方图结果证实了这一点。
  • $\ell_2$-SMD(SGD)最完整地保留了初始权重分布,直方图变化最小。
  • $\ell_{10}$-SMD显著将权重分布向更高幅度移动,导致几乎所有权重均非零,且集中在0.005至0.04之间。
  • 尽管训练损失完全相同且实现了完美插值,不同SMD变体的测试准确率存在显著差异,$\ell_{10}$-SMD在CIFAR-10上始终取得最高准确率。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。