[论文解读] Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration
LOGO 使用离线演示数据来指导在线 TRPO 学习,在稀疏奖励的强化学习中实现接近最优的性能和在不完整观测条件下的鲁棒完成。
A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.
研究动机与目标
- 解决 RL 中稀疏奖励信号下的学习挑战。
- 利用来自次优策略的离线演示数据来引导和启动在线学习。
- 开发一个两步的 LOGO 框架,结合策略改进与演示引导的策略选择。
- 提供性能改进的理论保证并扩展到不完整观测设置。
- 在 MuJoCo 基准测试和真实机器人实验(TurtleBot)中展示有效性。
提出的方法
- 使用 TRPO 进行策略改进步骤以生成候选策略。
- 添加一个策略引导步骤,在候选策略周围的信任区域内寻求接近离线行为策略的策略。
- 引入一个代理目标函数,使用中间策略的样本近似策略相关的 KL 散度。
- 推导一个面向策略相关奖励的性能差异引理的扩展,以支持代理目标。
- 给出基于泰勒展开的、可实现的更新,产生两个类似 TRPO 的更新。
- 通过投影状态并训练一个辨别器来估计部分数据中的策略相关奖励,将 LOGO 扩展到不完整观测。
实验结果
研究问题
- RQ1在稀疏奖励设置下,使用离线演示将 LOGO 的性能提升相对于纯 TRPO 吗?
- RQ2来自次优行为策略的引导如何影响探索和样本效率?
- RQ3关于每次学习 episode 的性能提升的理论保证是什么?
- RQ4LOGO 能否扩展到在不完整状态观测的设置中,同时保持有效性?
- RQ5结果是否能够从 MuJoCo 基准扩展到 Gazebo/TurtleBot 的真实世界机器人任务?
主要发现
- 在稀疏奖励环境中,LOGO 相比基线 TRPO 和模仿学习方法,学习更快且接近最优的性能。
- 两步 LOGO 程序(策略改进+策略引导)提供正式的性能保证,并通过行为策略引导加速早期学习。
- 在标准 MuJoCo 基准测试中,尽管奖励稀疏,LOGO 可以达到密集奖励最优算法的性能。
- 该框架通过使用基于辨别器的代理来估计策略相关奖励,扩展到不完整观测设置,保持强大性能。
- 在 Gazebo 的 TurtleBot 实验以及真实世界实验中展示了有效的航点跟踪和障碍物规避。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。