[论文解读] Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
本文在八个任务中针对四个调优预算和四种学习率调度,基准测试了15种热门深度学习优化器,结果显示优化器性能依赖于任务,且对多种优化器进行调优常常等同于对单一优化器进行调优。Adam 仍然是一个强基线,但没有单一方法在所有任务上占优。
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.
研究动机与目标
- 评估优化器选择和超参数调优如何影响深度学习训练性能。
- 提供基于实证证据的在实际中选择优化器的指南。
- 提供一个开放且可扩展的基线数据集,用于评估未来的优化器和超参数策略。
- 强调默认参数在不同问题中的表现与经过调优的配置的比较。
提出的方法
- 在 eight DEEPOBS 问题上基准测试 15 种热门的一阶优化器。
- 使用随机超参数搜索,评估四种调优预算(one-shot、small、medium、large)。
- 应用四种学习率调度(常数、余弦、带热启动的余弦、梯形)。
- 收集 53,760 条带有多个种子和性能度量的训练曲线。
- 为未来的基准测试提供开放获取的结果和基线曲线。
- 分析性能对问题、预算和调度的依赖性。
实验结果
研究问题
- RQ1优化器性能是否在不同深度学习任务之间具有泛化性,还是高度依赖于具体问题?
- RQ2与使用默认设置相比,调优预算如何影响优化器的相对性能?
- RQ3是否存在在所有测试任务中都占优的单一优化器,还是赢家会随问题而变化?
- RQ4在配对经过调优的超参数或调度时,未调优的默认参数仍具竞争力吗?
主要发现
- 优化器性能因任务而异,在所有八个问题上没有通用的赢家。
- 用默认超参数评估多个优化器通常与对单个优化器进行调优同样具有竞争力。
- 使用未调优的学习率调度在平均意义上有帮助,但效果因优化器和问题而异。
- Adam(及其变体)通常仍是强基线,而较新的方法并未始终优于它。
- 一些优化器在特定问题上表现良好,但结果并不能在任务之间均匀迁移。
- 开源结果为未来的优化器研究提供了具有挑战性的、经过良好调优的基线。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。