QUICK REVIEW

[论文解读] Distributed optimization of deeply nested systems

Miguel Á. Carreira-Perpiñán, Weiran Wang|arXiv (Cornell University)|Dec 24, 2012

Sparse and Compressive Sensing Techniques参考文献 28被引用 104

一句话总结

本文提出了辅助坐标法（MAC），一种用于深度嵌套系统（如深度神经网络）的新颖优化框架。通过在扩展空间中使用辅助变量将非凸嵌套优化问题重新表述为约束问题，MAC 实现了可证明收敛的、大规模并行化的优化，避免了梯度消失问题，并重用了现有的单层优化算法，在串行设置下也实现了与最先进方法相当的性能。

ABSTRACT

In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.

研究动机与目标

解决深度嵌套、非凸系统（如深度神经网络）中联合优化的挑战。
克服反向传播的局限性，包括梯度消失、并行化能力差以及对可微分参数的依赖。
开发一种通用优化策略，重用现有的单层优化算法，并支持分布式计算。
在具有可证明收敛性的分层系统中，实现自动架构选择与参数学习。

提出的方法

引入辅助坐标（Z）以表示隐藏单元激活，将深度嵌套函数转化为扩展空间中的约束优化问题。
将原始嵌套目标函数 E1(W) 替换为受等式约束 Znkh = fkh(zn,k−1; Wk) 限制的约束问题 E(W, Z)，其中约束条件针对每个数据点和每一层定义。
使用二次惩罚法求解约束问题，最小化增广拉格朗日函数 EQ(W, Z; µ) = E(W, Z) + (µ/2)∑‖Z − F(W, X)‖²。
在参数 W 和辅助坐标 Z 之间交替优化，实现对数据点和层的简单且大规模的并行化。
证明了可收敛至约束问题的 KKT 点，这些点在弱正则性条件下对应于原始嵌套问题的驻点。
允许使用非基于梯度的优化器，并通过依赖辅助变量重构方法处理不可微映射。

实验结果

研究问题

RQ1能否为深度嵌套系统开发一种通用优化方法，以避免反向传播中固有的梯度消失问题？
RQ2如何在具有可证明收敛性和可扩展性的分层系统中实现联合参数与架构学习？
RQ3能否在分布式、大规模并行环境中重用现有的单层优化算法，以实现嵌套系统的端到端训练？
RQ4在何种条件下，辅助坐标重构的驻点能对应于原始嵌套问题的有意义解？

主要发现

MAC 可实现对约束问题 KKT 点的可证明收敛，这些点在弱正则性条件下对应于原始嵌套优化问题的驻点。
该方法在数据点和层之间具有天然的并行性，可在云架构中实现高效的分布式计算。
MAC 收敛迅速，即使使用简单的局部优化器，通常在几次迭代内即可获得合理模型。
该方法对不可微映射具有鲁棒性，可与非基于梯度的优化算法结合使用。
理论分析证实，原始嵌套问题的极小值点、极大值点和鞍点与 MAC 约束问题的对应点之间存在一一对应关系。
实验结果表明，即使在串行计算设置下，MAC 的性能也与最先进非线性优化器相当。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。