QUICK REVIEW

[论文解读] Maximum Principle Based Algorithms for Deep Learning

Qianxiao Li, Long Chen|arXiv (Cornell University)|Oct 26, 2017

Model Reduction and Neural Networks被引用 82

一句话总结

本论文将深度学习建模为连续时间的最优控制问题，并推导基于 Pontryagin 最大原理（PMP）的训练算法，特别是 Successive Approximations 方法（MSA）以及具有收敛性保证的扩展 PMP/MSA，其具备有利的早期收敛、解耦的层级优化以及对平坦景观的潜在鲁棒性。

ABSTRACT

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

研究动机与目标

动机化并将深度学习形式化为一个连续时间的最优控制问题。
推导用于最优训练的 Pontryagin 最大原理（PMP）条件。
发展数值算法（MSA）以求解 PMP，并给出误差/收敛性分析。
引入扩展的 PMP/MSA，以改进收敛性并处理动力学可行性。
将该框架与深度残差网络联系起来，并讨论离散化和小批量训练的考虑。

提出的方法

定义动态图 Ẋt = f(t, Xt, θt)，损失为 Φ(XT) + ∫0T L(θt) dt。
引入哈密顿量 H(t, x, p, θ) = p·f(t, x, θ) − L(θ) 并给出 PMP 条件 (3)-(5)。
提出 Basic MSA: 交替推进 X，求解 P，然后在每个 t 处通过哈密顿量最大化来更新 θ。
修改为带增广哈密顿量 ṼH 的扩展 PMP，以惩罚哈密顿动力学可行性误差；推导具有收敛性保证的 Extended MSA (E-MSA)。
给出离散时间表述，显示与残差网络和反向传播的关系。
讨论小批量扩展以及哈密顿量最大化的实际考量。

实验结果

研究问题

RQ1PMP 是否能够为深度学习提供一个可行且收敛的替代梯度训练的方法？
RQ2扩展的 PMP/MSA 通过惩罚哈密顿动力学可行性误差来保证收敛吗？
RQ3基于 PMP 的训练在收敛速度和对鞍点敏感性方面与 SGD/Adam 相比如何？
RQ4PMP 框架如何离散化并与残差网络和反向传播相关联？
RQ5小批量训练的实际考量以及哈密顿量最大化的效率？

主要发现

基于 PMP 的训练产生前向/后向的哈密顿动力学，并具层级解耦的哈密顿量最大化，有望实现并行化。
Basic MSA 可能发散；具有扩增哈密顿量的扩展 MSA 在足够大的 ρ 下对扩展 PMP 提供收敛性保证。
扩展框架通过可行性项提供显式误差控制，并在目标 J(θ) 上实现下降。
数值实验表明，当哈密顿量最大化高效时，E-MSA 的每次迭代初始收敛速度较好，并且可缓解在平坦景观或接近鞍点时的慢收敛。
离散时间表述恢复传统的残差网络训练结构，放宽最大化步骤与梯度进行的反向传播相关。
讨论了小批量扩展，在适当条件下，收敛启发式由标准大数定律论证支持。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。