QUICK REVIEW

[论文解读] Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Benjamin Eysenbach, Ruslan Salakhutdinov|arXiv (Cornell University)|Jun 12, 2019

Reinforcement Learning in Robotics参考文献 63被引用 39

一句话总结

SoRB 将规划与深度强化学习结合，通过在重放缓冲区观测上构建图，并使用学习到的距离估计的目标条件策略，执行最短路径规划以解决长远且稀疏奖励任务；它在图像导航和未见环境上优于标准 RL 和相关的规划-RL 混合方法。

ABSTRACT

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

研究动机与目标

在没有奖励塑形或示范的情况下，解决高维观测的长时域控制任务的动机。
通过对先前看到的状态进行图搜索，将长时域规划解耦为子目标。
从目标条件强化学习中学习距离估计，以便在重放缓冲区图上进行规划。
展示 SoRB 在基于图像的视觉导航和对新环境的泛化上的经验性收益。

提出的方法

使用带目标重标注和分布式 RL 的离策略 RL 训练目标条件策略及其 Q/Value 函数。
将距离度量 d_sp(s,s_g) 定义为在所学策略下的状态之间的最短路径步数；将 V(s,s_g) 与 Q(s,a,s_g) 与负的最短路径距离相关。
在重放缓冲观测上构建带权有向图，边权等于预测距离，且边被截断至 MaxDist。
使用 Dijkstra 算法在基于缓冲区的图中找到最短路径，并引导目标条件策略到中间路标点。
对多个 Q 网络进行集成，以获得稳健的距离估计用于规划；使用分布式 RL 来表示距离不确定性。
算法 1 (SearchPolicy) 在重放缓冲区上进行规划，并根据距离和 MaxDist 将策略条件在下一个路标点或最终目标。

实验结果

研究问题

RQ1通过以目标条件值函数引导的回放缓冲区上的图搜索，是否能够在高维观测空间中实现对长时目标的可靠规划？
RQ2通过分布式 RL 和集合学习得到的距离估计是否为基于图像的导航任务提供稳健的规划信号？
RQ3SoRB 与标准目标条件 RL 以及先前的规划-RL 混合在长时稀疏奖励设置中的表现有何差异？
RQ4在计划于先前看到的观测之上，SoRB 是否能泛化到未见环境（如新房子）？
RQ5哪些组成部分（距离估计、集成、分布式 RL）对性能和鲁棒性至关重要？

主要发现

SoRB 能解决超过 100 步的长时稀疏奖励任务并提升对比标准 RL 的规划能力。
在具有图像观测的视觉导航中，SoRB 能达到远距离目标，而标准的目标条件 RL 在超出短期后就难以推进。
SoRB 在视觉导航任务上显著优于基线如 SPTM、VIN、HER 和 C51，尤其随着目标距离增加。
价值函数的集成对至少距离目标 10 步的情形贡献了 10-20% 的提升，且分布式 RL 对学习有意义的距离估计至关重要。
SoRB 能泛化到新的、未见的 SUNCG 房子，对距离目标为 10 步的目标时的成功率约为 80%，远超单独的目标条件 RL。
与 SPTM 比较，SoRB 给出更准确的距离预测，这与实际可导航性（通过精确度-召回）更一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。