QUICK REVIEW

[论文解读] Benchmarking Batch Deep Reinforcement Learning Algorithms

Scott Fujimoto, Edoardo Conti|arXiv (Cornell University)|Oct 3, 2019

Reinforcement Learning in Robotics参考文献 42被引用 160

一句话总结

该论文在固定 Atari 批量设置中对 off-policy 和批量 DRL 算法进行基准测试，并引入一个离散动作的 BCQ 变体，该变体优于先前方法，通常达到或超过行为策略。

ABSTRACT

Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.

研究动机与目标

评估当前的 off-policy 和批量 DRL 算法在统一的 Atari 批量设置下的表现。
评估在离散行动环境中的外推误差和稳定性。
在固定数据场景下为离散批量 DRL 确立一个强大、简单的基线。

提出的方法

在统一的 Atari 设置中，使用单一的 10M 转换的批次，回顾并实现若干批量 DRL 算法（QR-DQN、REM、BCQ、KL-Control、SPIBB-DQN）。
将 BCQ 适配为离散动作，以作为强基线。
通过价值估计和跨游戏的稳定性分析来诊断外推误差。
与在线 DQN 和在 9 个 Atari 游戏中的批量派生行为策略进行比较。

实验结果

研究问题

RQ1在单一行为策略批量设置的 Atari 上，标准的 off-policy DRL 方法表现是否良好？
RQ2像 BCQ 这样的批量/受限方法是否能在离散动作的批量 RL 中提供鲁棒的性能？
RQ3在离散批量 RL 中，外推误差如何表现，分布式或受限方法是否能缓解它？
RQ4在此设置下，离散动作的 BCQ 相对于现有批量 RL 算法的相对性能如何？

主要发现

在单一行为策略批量设置中，标准的 off-policy DRL 算法落后于在线 DQN 和行为策略。
QR-DQN 往往优于 DQN，但通常仍不及嘈杂的行为策略。
像 BCQ 这样的批量 RL 方法优于其他方法，并且常常达到或超过无噪声的行为策略。
KL-Control 显示出强劲的初始性能，但在跨游戏中的鲁棒性不足，价值估计发散导致若干情况下失败。
离散动作的 BCQ 变体在该设置中的测试批量 DRL 方法中取得了最先进的结果。
稳定的价值估计与更好的批量学习性能相关，强调外推误差缓解的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。