QUICK REVIEW

[论文解读] Krylov Subspace Descent for Deep Learning

Oriol Vinyals, Daniel Povey|arXiv (Cornell University)|Nov 18, 2011

Neural Networks and Applications参考文献 15被引用 78

一句话总结

本文提出Krylov子空间下降（Krylov Subspace Descent, KSD），一种用于深度学习的二阶优化方法。该方法通过梯度与Hessian-向量乘积构建Krylov子空间，然后在该低维子空间内使用数据子集上的BFGS进行优化。KSD在无需正半定Hessian近似或阻尼参数调优的情况下，实现了比Hessian Free（HF）、L-BFGS和SGD更快的收敛速度与更好的泛化性能。

ABSTRACT

In this paper, we propose a second order optimization method to learn models where both the dimensionality of the parameter space and the number of training samples is high. In our method, we construct on each iteration a Krylov subspace formed by the gradient and an approximation to the Hessian matrix, and then use a subset of the training data samples to optimize over this subspace. As with the Hessian Free (HF) method of [7], the Hessian matrix is never explicitly constructed, and is computed using a subset of data. In practice, as in HF, we typically use a positive definite substitute for the Hessian matrix such as the Gauss-Newton matrix. We investigate the effectiveness of our proposed method on deep neural networks, and compare its performance to widely used methods such as stochastic gradient descent, conjugate gradient descent and L-BFGS, and also to HF. Our method leads to faster convergence than either L-BFGS or HF, and generally performs better than either of them in cross-validation accuracy. It is also simpler and more general than HF, as it does not require a positive semi-definite approximation of the Hessian matrix to work well nor the setting of a damping parameter. The chief drawback versus HF is the need for memory to store a basis for the Krylov subspace.

研究动机与目标

开发一种更鲁棒、更具通用性的二阶优化方法，以应对大规模训练数据下的高维非凸深度学习问题。
消除Hessian Free（HF）优化中对启发式阻尼参数调优及正半定Hessian近似的需求。
在深度神经网络训练中，相较于SGD、L-BFGS和HF等现有方法，提升收敛速度与泛化性能。
探究高级二阶方法（如KSD）是否可消除深度网络对预训练的依赖。
在多种深度学习任务上，评估KSD使用Hessian与Gauss-Newton近似的效果。

提出的方法

KSD构建由梯度与连续Hessian-向量乘积张成的Krylov子空间：$\text{span}(\mathbf{g}_m, \mathbf{H}_m\mathbf{g}_m, \dots, \mathbf{H}_m^{K-1}\mathbf{g}_m)$，其中$K$为固定值（例如20或80）。
在每次迭代中，方法使用训练数据子集计算目标函数及其导数，并在Krylov子空间上执行BFGS优化。
通过Pearlmutter技巧高效计算Hessian-向量乘积，避免显式构造Hessian矩阵。
在需要时，使用Gauss-Newton矩阵作为Hessian的正定替代，从而在Hessian不定时也能实现稳定优化。
通过Krylov子空间隐式选择最优正则化路径，避免了Levenberg-Marquardt阻尼的需要。
所有计算（包括梯度与Hessian-向量乘积）均在GPU上使用小批量数据完成，以降低内存与计算开销。

实验结果

研究问题

RQ1一种避免显式Hessian求逆与阻尼参数调优的二阶优化方法，是否能在深度学习训练中超越Hessian Free（HF）与L-BFGS？
RQ2Krylov子空间下降是否在标准深度学习基准测试中实现比HF与L-BFGS更快的收敛速度与更好的泛化性能？
RQ3在过拟合非主导问题的情况下，使用KSD等先进二阶优化方法时，是否仍需预训练？
RQ4当Hessian非正半定时，KSD在使用Gauss-Newton矩阵与实际Hessian时表现如何？
RQ5KSD能否在无需HF所需的结构化阻尼的情况下，有效应用于递归神经网络？

主要发现

KSD在所有评估数据集（包括CURVES、MNIST、Aurora和Starcraft）上均比Hessian Free（HF）和L-BFGS收敛更快。
在MNIST分类任务中，KSD的交叉验证误差为1.70%，优于HF的2.01%，且两者训练误差均为零。
在Aurora语音识别任务中，KSD将交叉验证误差从HF的8.7%降至8.1%，且训练速度提升3.3倍。
在CURVES数据集上，KSD将交叉验证误差从0.25降至0.19，且训练时间仅为HF的20%。
使用Gauss-Newton矩阵替代Hessian时，性能未出现下降，且在Hessian非正半定时仍表现稳健。
除MNIST外，所有任务中KSD均无需预训练，而在MNIST上预训练仅带来轻微性能提升，表明KSD可能在多数深度学习场景中消除对预训练的依赖。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。