QUICK REVIEW

[论文解读] RMSProp and equilibrated adaptive learning rates for non-convex optimization.

Yann Dauphin, Harm de Vries|arXiv (Cornell University)|Feb 15, 2015

Stochastic Gradient Optimization Techniques参考文献 7被引用 183

一句话总结

本文提出 ESGD，一种无偏的随机估计器，用于均衡化预条件矩阵，通过考虑负 Hessian 特征值，改进了非凸优化中的自适应学习率。与 RMSProp 不同，后者是该预条件矩阵的有偏近似，ESGD 提供了更精确的步长方向，在增加极少计算开销的前提下，收敛速度优于 RMSProp 和 SGD。

ABSTRACT

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes, i.e., diagonal preconditioners. We show that the optimal preconditioner is based on taking the absolute value of the Hessian's eigenvalues, which is not what Newton and classical preconditioners like Jacobi's do. In this paper, we propose a novel adaptive learning rate scheme based on the equilibration preconditioner and show that RMSProp approximates it, which may explain some of its success in the presence of saddle points. Whereas RMSProp is a biased estimator of the equilibration preconditioner, the proposed stochastic estimator, ESGD, is unbiased and only adds a small percentage to computing time. We find that both schemes yield very similar step directions but that ESGD sometimes surpasses RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

研究动机与目标

为解决使用自适应学习率训练深度神经网络时的病态条件问题。
研究 Hessian 矩阵的负特征值在非凸设置下对优化动态的影响。
通过建模 Hessian 特征值的绝对值而非依赖经典牛顿法或雅可比预条件矩阵，设计更精确的自适应学习率方案。
开发一种无偏且计算高效的随机估计器，适用于大规模深度学习。

提出的方法

提出均衡化预条件矩阵，利用 Hessian 特征值的绝对值，在存在鞍点时稳定优化过程。
推导出 ESGD 作为均衡化预条件矩阵的无偏随机估计器，改进了 RMSProp 的有偏估计。
使用梯度平方值的运行平均来近似 Hessian 绝对特征值的逆，类似于 RMSProp，但增加了偏差校正。
提出一种新颖的更新规则，在保持 RMSProp 计算效率的同时，确保对预条件矩阵的无偏估计。
采用对角预条件策略，根据均衡化原理按参数自适应调整学习率。
分析 RMSProp 与均衡化预条件矩阵之间的关系，表明 RMSProp 是理想方案的有偏近似。

实验结果

研究问题

RQ1负 Hessian 特征值如何影响自适应学习率方法在非凸优化中的性能？
RQ2若考虑 Hessian 特征值的绝对值而非其符号，能否推导出更精确的预条件矩阵？
RQ3尽管 RMSProp 是理想预条件矩阵的有偏估计器，为何其在鞍点环境中仍表现良好？
RQ4能否设计一种计算成本极低的均衡化预条件矩阵的无偏随机估计器？
RQ5在实践中，所提出的 ESGD 方法是否比 RMSProp 和 SGD 收敛更快？

主要发现

ESGD 是均衡化预条件矩阵的无偏估计器，而 RMSProp 是同一理想方案的有偏近似。
ESGD 与 RMSProp 在优化过程中产生的步长方向非常相似，表明 RMSProp 的成功部分源于对均衡化原理的近似。
ESGD 在所有测试场景中均比普通随机梯度下降收敛更快。
ESGD 有时在收敛速度上超过 RMSProp，表明无偏估计可带来更优的优化动态。
ESGD 的计算开销极低，与 RMSProp 相比，仅增加少量训练时间。
基于 Hessian 绝对特征值的均衡化预条件矩阵，在存在鞍点的非凸设置下，理论上优于牛顿法和雅可比方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。