QUICK REVIEW

[论文解读] Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy.

Yijie Guo, Jongwook Choi|arXiv (Cornell University)|Jul 24, 2019

Reinforcement Learning in Robotics被引用 10

一句话总结

本文提出了一种轨迹条件化的策略，利用过去成功轨迹的记忆缓冲区，以在稀疏奖励强化学习中实现高效探索。通过基于多样化轨迹进行策略更新并鼓励在这些轨迹之外进行扩展，该方法在 Montezuma's Revenge 和 Pitfall 等具有挑战性的 Atari 游戏中实现了最先进性能，且无需专家演示或随机重置。

ABSTRACT

Reinforcement learning with sparse rewards is challenging because an agent can rarely obtain non-zero rewards and hence, gradient-based optimization of parameterized policies can be incremental and slow. Recent work demonstrated that using a memory buffer of previous successful trajectories can result in more effective policies. However, existing methods may overly exploit past successful experiences, which can encourage the agent to adopt sub-optimal and myopic behaviors. In this work, instead of focusing on good experiences with limited diversity, we propose to learn a trajectory-conditioned policy to follow and expand diverse past trajectories from a memory buffer. Our method allows the agent to reach diverse regions in the state space and improve upon the past trajectories to reach new states. We empirically show that our approach significantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima. In particular, without using expert demonstrations or resetting to arbitrary states, we achieve the state-of-the-art scores under five billion number of frames, on challenging Atari games such as Montezuma's Revenge and Pitfall.

研究动机与目标

为解决强化学习中稀疏奖励的挑战，即智能体很少获得非零奖励，从而导致策略优化缓慢的问题。
克服现有自我模仿学习方法过度利用过去成功轨迹的局限性，避免导致次优且短视的行为。
通过在存储的成功轨迹基础上进行扩展，而非简单模仿，使智能体能够探索状态空间的多样化区域。
在存在局部最优解的复杂环境中（如具有挑战性的 Atari 游戏）提升样本效率和性能，且不依赖专家演示或任意状态重置。

提出的方法

该方法使用一个存储过去成功轨迹的记忆缓冲区，以引导探索。
训练一种轨迹条件化的策略，使其能够遵循并从多样化的历史轨迹中泛化，从而促进对原始路径之外区域的探索。
通过结合存储轨迹上的模仿损失与内在好奇心或内在塑形，优化策略，以鼓励探索新状态。
通过动态地基于轨迹上下文对策略更新进行条件化，实现对过去成功经验的利用与对新状态区域探索之间的平衡。
该方法避免依赖基于计数的内在奖励或外部重置机制，而是将记忆缓冲区作为多样化行为先验的来源。

实验结果

研究问题

RQ1与标准自我模仿学习相比，轨迹条件化策略是否能提升稀疏奖励环境中探索的效率？
RQ2在存在局部最优解的环境中，通过在存储轨迹之外进行扩展，是否能带来更好的泛化能力和性能？
RQ3该方法是否能在无需专家演示或随机状态重置的情况下，在具有挑战性的 Atari 游戏中实现最先进结果？
RQ4过去轨迹的多样性在多大程度上影响智能体发现新高奖励状态的能力？

主要发现

所提方法在具有稀疏奖励和局部最优解的复杂任务中，显著优于基于计数的探索方法。
在五亿帧时，该方法在 Montezuma's Revenge 和 Pitfall 上实现了最先进性能，且未使用专家演示或任意状态重置。
该方法通过在历史轨迹基础上进行扩展，而非简单模仿，使智能体能够抵达状态空间的多样化区域。
通过避免过度依赖过去经验，该方法减少了短视行为，提升了长期学习效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。