QUICK REVIEW

[论文解读] Sample-Efficient Deep RL with Generative Adversarial Tree Search.

Kamyar Azizzadenesheli, Brandon Yang|arXiv (Cornell University)|Jun 15, 2018

Reinforcement Learning in Robotics被引用 11

一句话总结

本文提出生成对抗树搜索（GATS），一种样本高效的深度强化学习方法，结合了学习到的环境模型、有限深度的蒙特卡洛树搜索（MCTS）以及深度Q网络（DQN）进行价值估计。尽管在偏差-方差权衡和鲁棒性方面具有理论优势，GATS在Atari环境中的表现仍未能超越标准DQN，暴露出在学习模型上进行有限深度MCTS规划的局限性。

ABSTRACT

While many recent advances in deep reinforcement learning (RL) rely on model-free methods, model-based approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari environments. We deploy this model and propose generative adversarial tree search (GATS) a deep RL algorithm that learns the environment model and implements Monte Carlo tree search (MCTS) on the learned model for planning. While MCTS on the learned model is computationally expensive, similar to AlphaGo, GATS follows depth limited MCTS. GATS employs deep Q network (DQN) and learns a Q-function to assign values to the leaves of the tree in MCTS. We theoretical analyze GATS vis-a-vis the bias-variance trade-off and show GATS is able to mitigate the worst-case error in the Q-estimate. While we were expecting GATS to enjoy a better sample complexity and faster converges to better policies, surprisingly, GATS fails to outperform DQN. We provide a study on which we show why depth limited MCTS fails to perform desirably.

研究动机与目标

开发一种利用环境模型无监督数据的样本高效深度强化学习算法。
探究在树搜索框架中，将基于模型的规划与深度Q网络相结合的有效性。
分析在学习到的基于模型的规划系统中Q值估计的偏差-方差权衡。
理解为何在Atari环境中，基于学习模型的有限深度MCTS无法超越无模型DQN。

提出的方法

利用无监督数据学习Atari环境的深度生成模型，以表征环境动态。
在学习到的模型上应用有限深度的蒙特卡洛树搜索（MCTS）进行规划。
使用深度Q网络（DQN）在MCTS树的叶节点估计Q值，以指导探索和规划。
应用生成对抗训练目标以提升环境模型的质量和泛化能力。
对在模型不确定性下Q值估计的偏差-方差权衡进行理论分析。
在MCTS过程中使用学习到的模型进行滚动仿真，无需与真实环境交互。

实验结果

研究问题

RQ1结合学习到的环境模型、MCTS和DQN是否能带来比无模型DQN更优的样本效率和更快的收敛速度？
RQ2Q值估计中的偏差-方差权衡如何影响基于模型的规划系统性能？
RQ3为何在Atari环境中，基于学习模型的有限深度MCTS无法超越标准DQN？
RQ4在连续控制设置中，将有限深度MCTS应用于学习到的生成模型时，其关键失败模式是什么？
RQ5对环境模型进行对抗训练在多大程度上能提升GATS中的规划性能？

主要发现

尽管在偏差-方差控制方面具有理论优势，GATS在Atari环境中的表现仍未能超越标准DQN。
有限深度MCTS组件因在较长时域内模型误差传播而造成次优规划。
MCTS叶节点的Q值估计对模型不准确性高度敏感，削弱了基于模型规划的优势。
使用生成对抗训练目标虽提升了模型质量，但无法弥补有限深度搜索的结构性局限。
本研究发现，在所测试的Atari环境中，基于学习模型的规划并不天然比无模型DQN更具样本效率。
失败原因在于有限深度MCTS无法在不完美的模型上准确传播长期轨迹的价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。