QUICK REVIEW

[论文解读] Generalization and Regularization in DQN

Jesse Farebrother, Marlos C. Machado|arXiv (Cornell University)|Sep 29, 2018

Reinforcement Learning in Robotics参考文献 25被引用 100

一句话总结

本论文使用 Atari 2600 游戏的不同版本来评估 DQN 的泛化能力，结果表明 DQN 对训练风格过拟合，并且正则化加微调可以产生更一般、可复用的表示，从而提升样本效率。

ABSTRACT

Deep reinforcement learning algorithms have shown an impressive ability to learn complex control policies in high-dimensional tasks. However, despite the ever-increasing performance on popular benchmarks, policies learned by deep reinforcement learning algorithms can struggle to generalize when evaluated in remarkably similar environments. In this paper we propose a protocol to evaluate generalization in reinforcement learning through different modes of Atari 2600 games. With that protocol we assess the generalization capabilities of DQN, one of the most traditional deep reinforcement learning algorithms, and we provide evidence suggesting that DQN overspecializes to the training environment. We then comprehensively evaluate the impact of dropout and $\ell_2$ regularization, as well as the impact of reusing learned representations to improve the generalization capabilities of DQN. Despite regularization being largely underutilized in deep reinforcement learning, we show that it can, in fact, help DQN learn more general features. These features can be reused and fine-tuned on similar tasks, considerably improving DQN's sample efficiency.

研究动机与目标

评估 DQN 在微妙不同的 Atari 2600 游戏风格（模式/难度）上的泛化能力。
量化 DQN 对某一训练风格的过拟合倾向。
评估正则化技术（Dropout 和 L2）在不同风格上的 DQN 性能影响。
研究正则化表示是否可以在相关任务上重复使用并进行微调以提高样本效率。

提出的方法

在 ALE 中引入基于 Atari 2600 风格（模式和难度）的协议来测试泛化。
在默认风格（m0d0）下训练 DQN 50M 帧并在其他风格上评估。
在训练过程中对前四层应用 dropout 以及 L2 权重正则化；进行网格搜索以选择超参数。
在各风格中比较正则化策略与非正则化基线的性能。
在正则化预训练后，探索两种迁移学习策略：(i) 调整整个网络，(ii) 仅微调前几层。

实验结果

研究问题

RQ1在单一 Atari 风格上训练的 DQN 策略是否能泛化到视觉或动态上相似的风格？
RQ2传统正则化技术是否改善跨风格的泛化，或使 DQN 的表示更具可复用性？
RQ3带正则化的预训练是否在新风格上的微调性能优于从零开始训练？
RQ4在迁移到相关任务时，正则化表示在多大程度上降低了样本复杂度？

主要发现

DQN 策略在跨风格泛化方面表现不佳，在多款游戏（如 Freeway）中对训练风格出现过拟合。
训练中的正则化（dropout + L2）在若干情况下提升了跨风格的评估效果，并且可提升样本效率，尽管本身并不能保证跨风格泛化。
正则化表示可以作为在新风格上微调的更好初始化，在等效或更低总训练帧数下，常常优于从头初始化。
在正则化预训练后对整个网络进行微调，在若干游戏中取得显著提升（特别是 HERO 和 Space Invaders），表明学习到了一般特征。
在正则化预训练后仅微调前几层也有帮助，表明特征具有一定的分层迁移性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。