QUICK REVIEW

[论文解读] An Inertial Newton Algorithm for Deep Learning

Camille Castera, Jérôme Bolte|arXiv (Cornell University)|May 29, 2019

Model Reduction and Neural Networks参考文献 47被引用 19

一句话总结

本文提出INNA，一种新颖的惯性牛顿算法，用于深度学习，通过仅使用随机梯度和函数值，将二阶牛顿动力学与类似动量的惯性相结合。该算法在非光滑、非凸深度学习问题中实现了次线性收敛，并在CIFAR和MNIST基准测试中，经验性能与或优于ADAM和SGD等最先进方法，且超参数调优极少。

ABSTRACT

We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.

研究动机与目标

设计一种用于深度神经网络的二阶优化算法，通过仅使用随机梯度和函数值近似，结合牛顿曲率信息与惯性动量。
在非光滑、非凸深度学习损失函数的背景下，为INNA建立理论收敛性，此类损失函数在实践中普遍存在。
通过引入D-临界性的概念，解决小批量随机优化中虚假驻点的挑战。
提供一种鲁棒且可扩展的算法，允许采用阶为o(1/log k)的激进学习率，同时不牺牲稳定性。
在标准深度学习基准上，通过实证验证INNA在收敛速度和最终精度方面均具备与SGD、ADAM和ADAGRAD等最先进方法相媲美甚至更优的性能。

提出的方法

该算法源自一个包含惯性、阻尼、牛顿和梯度下降项的连续时间动力系统（DIN），经离散化后用于实际应用。
采用相空间提升技术避免直接计算Hessian矩阵，转而依赖梯度和函数值的随机近似。
该方法结合广义梯度预言机，并在温和优化框架下运行，以处理深度学习损失函数中的非光滑性。
提出一种基于D-临界点的新理论框架，用于分析并缓解由小批量子采样引起的虚假驻点问题。
采用形式为γ₀k⁻q（其中q ≤ 1/2）的步长规则，实现缓慢衰减，从而在实践中提升收敛性。
在适用于大多数深度学习问题的弱假设下，通过分析连续微分包含及其离散随机近似，证明了收敛性。

实验结果

研究问题

RQ1能否仅使用梯度和函数值的随机近似，设计一种用于深度学习的二阶惯性优化方法？
RQ2在高维、非光滑、非凸设置下，如何以计算可行且稳定的方式结合惯性和牛顿动力学？
RQ3在深度神经网络背景下，此类算法的收敛性可提供哪些理论保证？
RQ4小批量子采样效应如何导致虚假驻点？是否可对其进行形式化表征并避免？
RQ5所提出的算法能否在收敛速度和最终精度方面超越ADAM和SGD等现有最先进方法？

主要发现

即使在损失函数的正则性假设较弱的情况下，INNA也能实现对算法底层连续时间微分包含的次线性收敛。
D-临界点的引入为分析和避免由小批量随机子采样引起的非凸优化中虚假驻点提供了新的理论框架。
实证结果表明，INNA在CIFAR-10、CIFAR-100和MNIST上的性能与ADAM和SGD相当或更优，尤其在CIFAR-100的测试精度方面表现突出。
INNA对α和β超参数的选择具有鲁棒性，(0.5, 0.1)作为稳定默认设置，训练速度主要受这些参数影响。
采用缓慢的步长衰减k⁻¹⁴时，INNA在训练速度和最终性能上均优于ADAM，证明了激进学习率调度的优势。
该算法高度可调且可复现，仅需极少调优即可获得优异结果，支持其在真实深度学习工作流中的实际应用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。