QUICK REVIEW

[论文解读] Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

Guodong Zhang, James Martens|arXiv (Cornell University)|May 27, 2019

Stochastic Gradient Optimization Techniques参考文献 59被引用 41

一句话总结

论文在非线性过参数化神经网络上证明自然梯度下降（NGD）的全局收敛性和线性收敛率，在关于雅可比矩阵的两个条件下，并将结果扩展到 K-FAC 和一般损失，同时保持有利的泛化能力。

ABSTRACT

Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for \emph{nonlinear} networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions, and we give a bound on the rate of this convergence.

研究动机与目标

动机：使用自然梯度下降来解决神经网络优化中的病态曲率问题。
Identify simple, generic conditions on the network Jacobian that guarantee efficient convergence from random initializations.
将分析扩展到一般损失函数以及类似 K-FAC 的近似 NGD 方法。
Demonstrate that NGD can achieve faster convergence without sacrificing generalization.

提出的方法

使用 Fisher/ Gauss-Newton 矩阵及其在 F 奇异时的广义逆来定义 NGD 更新。
引入关于雅可比矩阵的两个条件：(i) 初始化时的全行秩，(ii) 参数微扰下雅可比矩阵的稳定性。
在这些条件下给出带步长界的 NGD 的线性收敛性证明。
将抽象分析应用于一个特定的过参数化两层 ReLU 网络，具有随机初始化和归一化输入。
证明在与 GD 相比下，NGD 将收敛速率提高为 O(lambda_min(G∞)/n)，并且在类似假设下 K-FAC 也实现线性收敛。

实验结果

研究问题

RQ1在非线性、过参数化的神经网络中，在哪些条件下自然梯度下降收敛到全局极小点？
RQ2在两层 ReLU 网络中，NGD 相较于梯度下降在收敛速率和学习率容忍度方面有何差异？
RQ3NGD 和 K-FAC 是否能对除平方误差外的一般损失函数的非线性网络提供可证明的全局收敛？
RQ4相对于标准梯度下降，NGD 对泛化的影响有哪些？

主要发现

在初始化时雅可比矩阵具有全行秩并且在附近保持稳定时，NGD 实现对全局最小值的线性收敛。
对于过参数化的两层 ReLU 网络，NGD 以常数步长收敛至 O(1)，在无穷宽度极限下可在 O(1) 次迭代内收敛。
在特定的两层网络设置下，NGD 相较于梯度下降在收敛速率上提供 O(lambda_min(G∞)/n) 的提升。
在相同假设和充分过参数化下，K-FAC 也线性收敛到全局最小值，其速率与数据 Gram 矩阵相关。
对于 NGD 的泛化界限与两层 ReLU 设置下对梯度下降所证明的界限一致，表明在更快收敛的同时没有泛化损失。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。