QUICK REVIEW

[論文レビュー] Practical Quasi-Newton Methods for Training Deep Neural Networks

Donald Goldfarb, Yi Ren|arXiv (Cornell University)|Jun 16, 2020

Stochastic Gradient Optimization Techniques参考文献 43被引用数 39

ひとこと要約

この論文は、Kronecker-factored ブロック対角 BFGS/L-BFGS の更新と二重減衰スキームを用いて深層ニューラルネットワークの訓練のための実用的な確率的準ニュートン法を開発し、KFACおよび第一階法と競合するまたは優れた性能を達成する。

ABSTRACT

We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n imes n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

研究の動機と目的

高次元性を考慮して、深層ニューラルネットワーク（DNN）の訓練における2次情報の利用を動機づける。
ヘッセ行列を近似するための、スケーラブルなKronecker-factoredブロック対角BFGS/L-BFGS更新を提案する。
非凸なDNNにおいて正定値性を維持し、固有値の変化を抑える減衰戦略を開発する。
層ごとのヘッセ行列近似のためのHessian-action BFGSと、特異性に対処するLM減衰を導入する。
提案された確率的準ニュートン法の収束保証を提供し、DNN上での経験的性能を示す。

提案手法

ヘッセ行列を各ブロックが1つの層に対応するブロック対角行列として表現し、各ブロックを2つの小さな行列（A_lとG_l）のKronecker積として近似する。
勾配に関してh_lに対する逆ヘッセ行列ブロックH_g^lを、正定値性を保証するように減衰BFGSまたはL-BFGSで更新する。
A_lブロックをHessian-action BFGSとLM減衰項で更新して特異性に対処する。すなわちA_l^{LM} = A_l + λ_A I。
更新を組み合わせてステップW_l^+を作成する。vec(W_l^+) − vec(W_l) = −α (H_g^l ⊗ H_a^l) vec(Ẽ∇f_l) を用い、ヘッセンベルグ構造の（Kronecker）前処理を適用する。
y^T H y / s^T yとs^T s / s^T yの両方の比を境界付けるダブルダンピング（DD）スキームを導入し、確率的設定におけるBFGS更新の安定性を保つ。
確率的準ニュートン枠組み内で収束解析を提供し、GPU上の効率性のための非ループL-BFGS実装について議論する。

実験結果

リサーチクエスチョン

RQ1層ごとのKronecker構造を利用することで、確率的準ニュートン法を大規模DNNの訓練に実用的にすることは可能か。
RQ2非凸で確率的な訓練設定において、ダブルダンピングスキームは正定値性と固有値の変化の境界を保証するか。
RQ3標準の自己符号器ベンチマークにおける訓練効率と一般化の点で、K-BFGSおよびK-BFGS(L)はKFACや一階法とどう比較されるか。
RQ4標準的な確率的最適化仮定の下で、Kronecker-factored確率的準ニュートン法の収束挙動はどうなるか。

主な発見

K-BFGSとK-BFGS(L)は、層ごとのKronecker分解のおかげで2次情報を提供しつつ、ストレージと1反復あたりのコストを一階法と同等に保つ。
K-BFGS/Lは、一階法と比較して訓練とテストの性能が有利で、多くの場合KFACと競合またはそれを上回る。
A_lブロックに対するHessian-action BFGSとLM減衰を組み合わせると、A_lが特異または悪条件でも安定した更新を得られる。
ダブルダンピング手続きは正定値性と固有値の境界を維持し、確率的非凸最適化における堅牢性を向上させる。
MNIST、FACES、CURVESでの実験は、訓練損失とテスト誤差の点でKFACおよび一階法より優れているか同等で、良好な一般化を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。