QUICK REVIEW

[论文解读] Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Benjamin Eysenbach, Ruslan Salakhutdinov|arXiv (Cornell University)|Jun 12, 2019

Reinforcement Learning in Robotics被引用 70

一句话总结

SoRB 将通过图搜索的规划与目标条件强化学习相结合，方法是在回放缓冲区状态上构建距离图，并规划到远距离目标的最短路径。

ABSTRACT

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

研究动机与目标

通过将目标分解为通过对过去观测进行规划自动发现的子目标，来解决高维观测中的长时程、稀疏奖励任务的挑战。
利用目标条件的策略来解决每个子目标，并将回放缓冲区作为用于规划的非参数状态图。
使用分布式强化学习和集合来获得鲁棒的距离估计，以指导图搜索。
在长时程导航任务上展示相较于标准强化学习的改进性能，并展示对未见环境的泛化能力。

提出的方法

使用离线强化学习算法对目标条件策略及其 Q 值函数进行学习，并结合目标重标记和分布式强化学习。
定义最短路径距离 d_sp(s,s_g)，并将 V(s,s_g) 与 Q(s,a,s_g) 与负的最短路径距离相关联。
在回放缓冲区观测上构建一个图，边权等于预测的距离，且被 MaxDist 限制。
使用 Dijkstra 的算法在回放缓冲区图上找到起点到目标的最短路径。
在执行阶段，沿着路径规划一系列中间点，并在下一个中间点作为条件来调度策略，若更接近则直接以目标为条件。
通过分布式强化学习（代表距离步的箱区）和集合来增强距离估计和不确定性。

实验结果

研究问题

RQ1通过对回放缓冲区进行图搜索的计划，是否可以在高维环境中找到到达远距离目标的子目标序列？
RQ2分布式强化学习和集合学习得到的距离估计是否为 SoRB 提供了可靠的规划引导？
RQ3与标准目标条件强化学习相比，SoRB 是否在长时程、稀疏奖励任务上表现更好并对未见环境具备泛化能力？
RQ4在图像为基础的导航任务中，SoRB 与半参数化拓扑记忆（SPTM）及其他基线相比如何？

主要发现

SoRB 能够解决超过 100 步的长时程、稀疏奖励任务，并且泛化能力优于标准 RL 方法。
在回放缓冲区上的图搜索，由以目标条件的值函数距离引导，能够为图像为基础的领域导航提供有效的中间点序列。
分布式强化学习和集合显著改善距离估计和规划鲁棒性，特别是对于较远的目标。
在视觉导航中，SoRB 明显优于包括 SPTM、C51、VIN 和 HER 在内的基线，尤其随着目标距离的增加。
SoRB 能泛化到 SUNCG 中的新房子，在距离增大时保持较高的目标成功率，而纯粹的目标条件强化学习在此上表现困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。