QUICK REVIEW

[论文解读] Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

John D. Co-Reyes, YuXuan Liu|arXiv (Cornell University)|Jun 7, 2018

Reinforcement Learning in Robotics参考文献 28被引用 67

一句话总结

SeCTAR 学习一个连续的轨迹潜在空间，使用一个轨迹层级的变分自编码器（VAE）配备状态解码器和潜在条件策略解码器，使在潜在空间进行模型预测规划成为可能，从而在长时程、稀疏奖励任务中实现模型基础的规划。

ABSTRACT

In this work, we take a representation learning perspective on hierarchical reinforcement learning, where the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories. A key component of this method is to learn both a latent-conditioned policy and a latent-conditioned model which are consistent with each other. Given the same latent, the policy generates a trajectory which should match the trajectory predicted by the model. This model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior. We propose a novel algorithm for performing hierarchical RL with this model, combining model-based planning in the learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.

研究动机与目标

通过对轨迹建模而非原始动作来激励分层强化学习的表示学习。
提出一个连续的技能潜在空间，以实现时间上延长、可重用的行为。
开发一个双头解码框架（状态解码器和策略解码器），以确保一致性并实现规划。
在潜在空间中整合基于模型的规划，并结合无监督探索目标，以应对稀疏奖励。

提出的方法

将变分自编码器框架扩展到轨迹，使用轨迹编码器 q_phi(z|tau)。
使用状态解码器 p_theta_SD(tau|z) 从潜在z生成轨迹。
引入策略解码器 p_theta_PD(a|s,z)，在环境中执行以实现潜在轨迹。
通过最小化 KL(p_theta_PD(tau|z) || p_theta_SD(tau|z)) 来强制解码器之间的一致性，同时最大化 ELBO。
用循环网络训练状态轨迹的编码器/解码器，用前馈网络训练策略解码器。
使用模型预测控制在潜在空间进行规划，将状态解码器作为闭环行为的预测模型。

实验结果

研究问题

RQ1能否在没有手工指定子目标或离散技能的情况下学习一个连续的轨迹潜在空间？
RQ2轨迹层级的 VAE 与潜在条件策略的联合训练是否能在长时域上实现可靠规划？
RQ3在潜在空间中进行基于模型的规划，并辅以基于熵的探索目标，是否能改善稀疏奖励任务的表现？
RQ4状态解码器是否能为高级潜在动作提供有意义的结果预测？
RQ5SeCTAR 与现有的无模型、基于模型和分层 RL 方法在长时任务上的表现有何比较？

主要发现

SeCTAR 能在较长轨迹上进行规划，并在长时程、稀疏奖励任务上优于若干基线。
潜在空间的 MPC 规划器利用状态解码器作为轨迹预测器，以选择最大化奖励的潜在动作。
联合训练产生一致的状态解码器和策略解码器，使闭环规划与探索效果更好。
以轨迹边际熵为指导的无监督探索提升了状态空间覆盖率和探索质量。
潜在空间的插值产生连贯的轨迹，表明潜在表示具有意义且可泛化。
在所测试的任务中，SeCTAR 的性能和样本效率高于 TRPO、A3C、VIME、FeUdal Networks、以及 option-critic。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。