QUICK REVIEW

[论文解读] Practical Quasi-Newton Methods for Training Deep Neural Networks

Donald Goldfarb, Yi Ren|arXiv (Cornell University)|Jun 16, 2020

Stochastic Gradient Optimization Techniques参考文献 43被引用 39

一句话总结

本文通过使用 Kronecker 因子分解的分块对角 BFGS/L-BFGS 更新和双阻尼策略，开发实用的随机准牛顿方法来训练深度神经网络，在性能上与 KFAC 和一阶方法具有竞争力甚至优于它们。

ABSTRACT

We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n imes n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

研究动机与目标

在高维度条件下，动机在深度神经网络（DNN）训练中使用二阶信息。
提出可扩展的 Kronecker 因子分解的分块对角 BFGS/L-BFGS 更新，用以近似 Hessian。
Develop a damping strategy to maintain positive definiteness and bound eigenvalue changes in nonconvex DNNs.
引入 Hessian-action BFGS，用于逐层 Hessian 近似，并结合 LM 阻尼以处理奇异性。
为所提出的随机准牛顿方法提供收敛性保证，并在 DNNs 上展示经验性能。

提出的方法

将 Hessian 表示为分块对角矩阵，每个块对应一个层，并将每个块近似为两个较小矩阵 A_l 和 G_l 的 Kronecker 积。
通过对关于 h_l 的梯度进行阻尼 BFGS 或 L-BFGS 更新逆 Hessian 块 H_g^l，确保正定性。
使用带有 LM 阻尼项的 Hessian-action BFGS 更新 A_l 块以处理潜在的奇异性，即 A_l^{LM} = A_l + λ_A I。
将更新组合以形成步长 W_l^+，使 vec(W_l^+) − vec(W_l) = −α (H_g^l ⊗ H_a^l) vec(Ẽ∇f_l)，并应用一个 Hessenberg 结构的 Kronecker 预条件化。
引入双阻尼（DD）方案，界定 y^T H y / s^T y 与 s^T s / s^T y 的比值，确保在随机设置下 BFGS 更新的稳定性。
在随机准牛顿框架内给出收敛性分析，并讨论用于提升 GPU 效率的非循环 L-BFGS 实现。

实验结果

研究问题

RQ1是否可以通过利用逐层 Kronecker 结构使随机准牛顿方法在训练大规模 DNN 时变得实用？
RQ2双阻尼方案是否在非凸、随机训练情境中确保正定性和对特征值变化的界限？
RQ3在标准自编码器基准测试中，K-BFGS 与 K-BFGS(L) 相较于 KFAC 以及一阶方法在训练效率和泛化方面的表现如何？
RQ4在标准随机优化假设下，Kronecker 因子分解的随机准牛顿方法的收敛行为如何？

主要发现

K-BFGS 和 K-BFGS(L) 的存储成本和每次迭代代价与一阶方法相当，同时由于分层 Kronecker 因子分解而提供二阶信息。
K-BFGS/L 在训练和测试性能方面相对于一阶方法表现良好，在许多情况下与 KFAC 竞争甚至优于之。
对 A_l 块的 Hessian-action BFGS 结合 LM 阻尼可在 A_l 奇异或条件数差时仍实现稳定更新。
双阻尼过程维持正定性并对特征值进行界限，有助于在随机非凸优化中的鲁棒性。
在 MNIST、FACES 与 CURVES 的实验表明，在训练损失和测试误差方面，表现优于或与 KFAC、一阶方法相当，且具备良好的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。