QUICK REVIEW

[论文解读] Planning with Goal-Conditioned Policies

Soroush Nasiriany, Vitchyr H. Pong|arXiv (Cornell University)|Nov 19, 2019

Reinforcement Learning in Robotics被引用 53

一句话总结

LEAP 将无模型的目标条件策略与在学习出的潜在状态空间上的规划结合起来，以解决来自高维观测（如图像）的长时程任务。

ABSTRACT

Planning methods can solve temporally extended sequential decision making problems by composing simple behaviors. However, planning requires suitable abstractions for the states and transitions, which typically need to be designed by hand. In contrast, model-free reinforcement learning (RL) can acquire behaviors from low-level inputs directly, but often struggles with temporally extended tasks. Can we utilize reinforcement learning to automatically form the abstractions needed for planning, thus obtaining the best of both approaches? We show that goal-conditioned policies learned with RL can be incorporated into planning, so that a planner can focus on which states to reach, rather than how those states are reached. However, with complex state observations such as images, not all inputs represent valid states. We therefore also propose using a latent variable model to compactly represent the set of valid states for the planner, so that the policies provide an abstraction of actions, and the latent variable model provides an abstraction of states. We compare our method with planning-based and model-free methods and find that our method significantly outperforms prior work when evaluated on image-based robot navigation and manipulation tasks that require non-greedy, multi-staged behavior.

研究动机与目标

在不对环境进行详细建模的前提下，将无模型强化学习与规划结合，以实现时间上的组合性。
提出利用目标条件价值函数作为隐式模型的子目标规划。
学习潜在状态表示，以将子目标保持在有效状态的流形上。
证明在潜在子目标上进行规划并结合实现目标的策略，在视觉基任务上优于先前的无模型和基于模型的方法。

提出的方法

使用用时间差分模型（TDMs）训练的目标条件策略作为短期控制器。
在通过变分自编码器（VAE）学习的低维潜在空间中对中间子目标进行规划。
对子目标定义通过 V(s,g,t) 表示的可达性可行性向量，并最小化其范数以选择子目标。
在潜在空间中优化子目标，同时对潜在概率较低的情况施以惩罚，以保持在有效状态的流形上。
使用 VAE 解码器将潜在子目标解码为实际状态目标，并通过目标条件策略执行。
通过在潜在空间中规划而非原始像素来处理高维观测，并在强化学习中重复使用 VAE 编码器。

实验结果

研究问题

RQ1目标条件策略能否作为长时程任务规划的抽象？
RQ2在目标为高维（如图像）时，进行潜在表示的规划是否能提高可行性和性能？
RQ3在基于图像的导航和操作任务中，LEAP 与纯无模型和纯模型驱动的方法相比如何？
RQ4重复使用预训练VAE编码器对学习效率和性能有何影响？

主要发现

LEAP 在基于视觉的导航和操作任务上优于之前的无模型和基于模型的方法。
在具有 TDM 基础策略的三个潜在子目标上进行规划，比仅使用短期子目标更快实现长时目标。
在潜在子目标上的优化产生有意义的子目标，对应可行状态，与在原始图像像素上优化不同。
重复使用 VAE 编码器，相比从头开始训练 RL 网络可以加速学习。
消融实验表明，在潜在空间中的规划要比在图像空间中直接规划有效显著更高。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。