QUICK REVIEW

[論文レビュー] A block coordinate descent optimizer for classification problems exploiting convexity

Ravi G. Patel, Nathaniel Trask|arXiv (Cornell University)|Jan 1, 2020

3D Shape Modeling and Analysis参考文献 25被引用数 3

ひとこと要約

この論文では、線形層の重みにおける交差エントロピー損失の凸性を活用する、深層学習分類のためのハイブリッドニュートン/勾配降下（NGD）最適化手法を提案する。線形層に対してニュートンステップ（グローバル最適性を保証）を、隠れ層に対して勾配降下を交互に適用することで、収束を加速し、テスト精度を向上させる。CIFAR-10では最大4倍速く収束し、ConvNetアーキテクチャでは最終テスト精度が1.76%向上した。

ABSTRACT

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

研究の動機と目的

深層ニューラルネットワークの線形層における凸性を活用する2次最適化手法の開発。
線形および非線形重みの最適化を分離することで、学習コストを低減し、収束速度を向上。
2次最適化手法が分類タスクにおいて、確率的勾配降下（SGD）を上回る精度と収束性能を示せるかを検証。
最適化スキームの選択が、隠れ層に学習される基底関数に与える影響を調査。

提案手法

ブロック座標降下を用い、線形層重みWに対するニュートンステップと、隠れ層重みξに対する勾配降下を交互に適用。
ξを固定した場合、損失関数はWに関して凸となるため、ラインサーチを伴うニュートン法によりグローバル最小化が可能。
ヘッセ行列の計算は、隠れ層の深さや幅に依存せず、線形層重みの数にのみ依存する。
計算効率と安定性を維持するため、ミニバッチを用いてニュートンステップを適用。
TensorFlowで実装され、github.com/rgp62/ でオープンソース化された。
隠れ層をデータ駆動型の適応的基底と解釈し、線形層重みがこれらの基底に対する最適なフィッティングを提供する。

実験結果

リサーチクエスチョン

RQ1線形層重みにおける凸性を活用することで、深層ニューラルネットワークの分類タスクにおいて、より高速かつ高精度な学習が可能になるか？
RQ2NGD最適化手法は、標準的な確率的勾配降下（SGD）と比較して、収束速度および最終的精度の点で優れているか？
RQ3NGDとGDで学習された際、隠れ層がエンコードする基底関数にどのような定性的な差が生じるか？
RQ4ニュートンステップの許容誤差が、モデルの一般化性能およびロバストネスに与える影響は？
RQ5計算コストが著しく増大することなく、2次最適化を深層ネットワークに効率的に適用できるか？

主な発見

CIFAR-10ベンチマークにおいて、NGDはGDと比較して反復回数の約1/4で最大検証精度に到達した。
CIFAR-10のConvNetアーキテクチャにおいて、NGDはGDと比較して最終テスト精度を1.76%向上させた。
MNIST、Fashion MNIST、およびpeaksベンチマークにおいても、NGDはGDよりも高速に高い検証精度に到達した。
NGDが学習した基底関数は、GDが生成するものと比較して、はるかに規則的かつ構造的なパターンを示しており、パラメータ空間の探索様式に定性的な差が生じていることが示唆された。
線形層重みのヘッセ行列が対称かつ半正定値であることが証明され、ニュートン法によるグローバル最小解の存在が裏付けられた。
本手法は、ドレインネットや畳み込みネットワークを含む多様なアーキテクチャにおいて一貫した改善を示したが、アーキテクチャの変更は不要であった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。