QUICK REVIEW

[論文レビュー] Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

Robin M. Schmidt, Frank Schneider|arXiv (Cornell University)|Jul 3, 2020

Mobile Crowdsensing and Crowdsourcing被引用数 48

ひとこと要約

本論文は、8つのタスクにわたり、4つのチューニング予算と4つの学習率スケジュールで、15の人気ディープラーニング最適化手法を比較ベンチマークし、最適化手法の性能はタスク依存であること、複数の最適化手法をチューニングすることが、単一の最適化手法をチューニングするのと同等の場合があることを明らかにしている。Adamは依然として強力なベースラインであるが、全てのタスクで優位な単一手法は存在しない。

ABSTRACT

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

研究の動機と目的

最適化手法の選択とハイパーパラメータチューニングがディープラーニングのトレーニング性能に与える影響を評価する。
実務での最適化手法選択に関する経験的・証拠に基づくガイドラインを提供する。
将来の最適化手法やハイパーパラメータ戦略を評価するためのオープンで拡張可能なベースラインデータセットを提供する。
問題間でデフォルトパラメータとチューニング済み設定の比較を強調する。

提案手法

DEEPOBSの8つの問題に対して、15の人気のあるファーストオーダー最適化手法をベンチマークする。
ランダムハイパーパラメータ探索を用いて、4つのチューニング予算（one-shot、small、medium、large）を評価する。
4つの学習率スケジュール（constant、cosine、cosine with warm restarts、trapezoidal）を適用する。
複数のシードと性能指標を用いて53,760本のトレーニング曲線を収集する。
将来のベンチマークングのためのオープンアクセス結果とベースライン曲線を提供する。
問題、予算、スケジュールに対する性能依存性を分析する。

実験結果

リサーチクエスチョン

RQ1最適化手法の性能は異なるディープラーニングタスク全体に一般化するのか、それとも問題依存性が高いのか？
RQ2デフォルトを使用した場合と比較したとき、チューニング予算は最適化手法の相対的な性能にどう影響するか？
RQ3全てのテストタスクで支配的な単一の最適化手法は存在するのか、それとも勝者は問題ごとに異なるのか？
RQ4チューニング済みのハイパーパラメータやスケジュールと組み合わせた場合、未調整のデフォルトは依然として競争力があるか？

主な発見

最適化手法の性能はタスクによって大きく異なり、8つの問題すべてで普遍的な勝者は存在しない。
デフォルトのハイパーパラメータを用いた複数の最適化手法の評価は、単一の最適化手法をチューニングするのとしばしば競争力がある。
未調整の学習率スケジュールを使用すると平均的には役立つが、効果は最適化手法と問題によって異なる。
Adam（およびその派生手法）は一般に強力なベースラインとして残るが、新しい手法が一貫してそれを上回るわけではない。
いくつかの最適化手法は特定の問題で良い性能を示すが、結果はタスク間で一様には転用できない。
オープンソースの結果は、将来の最適化手法研究に向けて挑戦的で十分にチューニングされたベースラインを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。