QUICK REVIEW

[論文レビュー] Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar, Richard Socher|arXiv (Cornell University)|Dec 20, 2017

Neural Networks and Applications参考文献 3被引用数 403

ひとこと要約

この論文は、Adam で開始し、勾配サブスペース投影基準が満たされたときに SGD へ切り替える自動ハイブリッド最適化手法 SWATS を提案し、複数タスクでの汎化性能を改善する。

ABSTRACT

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.

研究の動機と目的

適応法（Adam）と SGD の間の汎化ギャップを動機づける。
Adam の迅速な初期進捗と SGD の汎化を組み合わせたハイブリッド学習戦略を提案する。
追加のハイパーパラメータを必要としない自動切替機構を開発する。
画像分類と言語モデリングのベンチマークに渡って手法を実証する。

提案手法

SWATS を、Adam で開始し、射影ベースの基準が発火したときに SGD に切り替える2相最適化手法として定義する。
Adam ステップ p_k と勾配 g_k を計算し、SGD 方向が Adam ステップと整合するように非直交射影から SGD 学習率 gamma_k を導出する。
切替後の SGD 率を推定するため gamma_k の指数移動平均 lambda_k を維持する。
|lambda_k/(1-beta2^k) - gamma_k| < epsilon のとき切替を発火させ、SGD 学習率 Lambda = lambda_k/(1-beta2^k) を得る。
Adam に含まれるパラメータ以外の追加のハイパーパラメータは導入せず、切替前にバイアス補正されたモーメンタムベースの Adam 更新を使用する。
DenseNet、ResNet、PyramidNet、SENet、Tiny-ImageNet、PTB と WT2 の言語モデルで CIFAR-10/100 に対して SGD と Adam に対して SWATS を評価する。

実験結果

リサーチクエスチョン

RQ1Adam と SGD を組み合わせたハイブリッド最適化手法は、Adam の迅速な初期進捗を保ちつつ SGD に近い汎化性能を達成できるか？
RQ2ハイパーパラメータを追加せずに最適な切替点を決定する自動切替基準とは何か？
RQ3SWATS は純粋な Adam や SGD と比較して画像分類と言語モデリングの多様なタスクでどのように性能を示すか？

主な発見

モデル	データセット	SGDM	Adam	SWATS	Lambda	切替ポイント（エポック）
ResNet-32	CIFAR-10	0.1	0.001	0.001	0.52	1.37
DenseNet	CIFAR-10	0.1	0.001	0.001	0.79	11.54
PyramidNet	CIFAR-10	0.1	0.001	0.0007	0.85	4.94
SENet	CIFAR-10	0.1	0.001	0.001	0.54	24.19
ResNet-32	CIFAR-100	0.3	0.002	0.002	1.22	10.42
DenseNet	CIFAR-100	0.1	0.001	0.001	0.51	11.81
PyramidNet	CIFAR-100	0.1	0.001	0.001	0.76	18.54
SENet	CIFAR-100	0.1	0.001	0.001	1.39	2.04
LSTM	PTB	55†	0.003	0.003	7.52	186.03
QRNN	PTB	35†	0.002	0.002	4.61	184.14
LSTM	WT-2	60†	0.003	0.003	1.11	259.47
QRNN	WT-2	60†	0.003	0.004	14.46	295.71

SWATS は複数のアーキテクチャとデータセットにおいておおむね SGD と Adam の最良性能をいずれも上回る。
切替は CIFAR データセットではしばしば最初の 20 エポック内で、Tiny-ImageNet では約 49 エポックで起こり、切替時には一時的な劣化が生じるがその後回復する。
切替後の学習率 Lambda はタスク間で調整された SGD 率と一致する（表1 に示されるように）。
Adam は初期進捗が力強いが SGD に比べ汎化が劣る。SWATS は切替を情報に基づく点で行うことでこのギャップを埋める。
言語モデリングタスクでは、SWATS は Adam と同等の汎化を達成しつつピーク性能へ到達するのに必要な学習エポック数が少なくなる可能性がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。