QUICK REVIEW

[论文解读] A block coordinate descent optimizer for classification problems exploiting convexity

Ravi G. Patel, Nathaniel Trask|arXiv (Cornell University)|Jan 1, 2020

3D Shape Modeling and Analysis参考文献 25被引用 3

一句话总结

该论文提出了一种用于深度学习分类的混合牛顿/梯度下降（NGD）优化器，该优化器利用了线性层权重中交叉熵损失的凸性。通过在确保全局最优性的线性层上交替执行牛顿步，在隐藏层上执行梯度下降，NGD 加速了收敛并提高了测试准确率，在 CIFAR-10 上实现了最高 4 倍的收敛速度提升，并在卷积神经网络架构上实现了 1.76% 的最终测试准确率提升。

ABSTRACT

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

研究动机与目标

开发一种二阶优化方法，利用深度神经网络线性层中的凸性，以实现分类任务。
通过解耦线性与非线性权重的优化，降低训练成本并提升收敛速度。
探究二阶方法是否能在准确率和收敛性方面超越随机梯度下降用于分类任务。
研究优化方案的选择如何影响隐藏层中学习到的基函数。

提出的方法

该方法使用块坐标下降，交替在权重 W 的线性层上执行牛顿步，在隐藏层权重 ξ 上执行梯度下降。
在固定 ξ 的情况下，损失函数在 W 上是凸的，因此可通过带线搜索的牛顿法实现全局最小化。
海森矩阵的计算仅与线性层权重的数量有关，而不依赖于隐藏层的深度或宽度。
牛顿步在小批量数据上执行，以保持计算效率和稳定性。
该算法在 TensorFlow 中实现，并在 github.com/rgp62/ 开源。
该方法将隐藏层解释为数据驱动的自适应基函数，其中线性层权重为这些基函数提供最优拟合。

实验结果

研究问题

RQ1利用线性层权重中的凸性是否能带来更快、更准确的深度神经网络分类训练？
RQ2NGD 优化器与标准随机梯度下降相比，在收敛速度和最终准确率方面表现如何？
RQ3当使用 NGD 和 GD 训练时，隐藏层编码的基函数在定性上有哪些差异？
RQ4牛顿步中的容差如何影响模型的泛化能力和鲁棒性？
RQ5是否能以可接受的计算成本高效地将二阶优化应用于深度网络？

主要发现

在 CIFAR-10 基准测试中，NGD 达到最大验证准确率所用的迭代次数仅为 GD 的四分之一左右。
对于 CIFAR-10 的 ConvNet 架构，NGD 相较于 GD 提高了 1.76% 的最终测试准确率。
在 MNIST、Fashion MNIST 和 peaks 基准测试中，NGD 比 GD 更快达到更高的验证准确率。
NGD 学习到的基函数表现出显著更规则、更结构化的模式，表明在参数空间探索中存在定性差异。
已证明线性层权重的海森矩阵为对称半正定矩阵，确认了牛顿法存在全局最小解。
该方法在多种架构（包括全连接和卷积网络）中均表现出一致的性能提升，且无需修改网络结构。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。