QUICK REVIEW

[论文解读] Dynamics-Aware Unsupervised Discovery of Skills

Archit Sharma, Shixiang Gu|arXiv (Cornell University)|Jul 2, 2019

Reinforcement Learning in Robotics参考文献 68被引用 75

一句话总结

DADS 以无监督方式发现一组连续且可预测的技能，并使用它们学习的动力学进行零样本基于模型的规划，超越强基线。

ABSTRACT

Conventionally, model-based reinforcement learning (MBRL) aims to learn a global model for the dynamics of the environment. A good model can potentially enable planning algorithms to generate a large variety of behaviors and solve diverse tasks. However, learning an accurate model for complex dynamical systems is difficult, and even then, the model might not generalize well outside the distribution of states on which it was trained. In this work, we combine model-based learning with model-free learning of primitives that make model-based planning easy. To that end, we aim to answer the question: how can we discover skills whose outcomes are easy to predict? We propose an unsupervised learning algorithm, Dynamics-Aware Discovery of Skills (DADS), which simultaneously discovers predictable behaviors and learns their dynamics. Our method can leverage continuous skill spaces, theoretically, allowing us to learn infinitely many behaviors even for high-dimensional state-spaces. We demonstrate that zero-shot planning in the learned latent space significantly outperforms standard MBRL and model-free goal-conditioned RL, can handle sparse-reward tasks, and substantially improves over prior hierarchical RL methods for unsupervised skill discovery.

研究动机与目标

在无外在奖励的情况下激励学习多样、可预测的技能，以促进规划。
开发一个技能条件策略和一个技能特定的转移模型，使在潜在空间中的规划成为可能。
证明连续技能空间比离散技能集合能够实现更丰富、可控的行为。
通过使用基于模型的方法在学习到的潜在空间中进行规划，展示零样本任务求解。

提出的方法

最大化互信息目标 I(s′; z | s) 以鼓励技能的多样性同时具备可预测性。
学习一个技能条件策略 π(a|s, z) 以及一个技能条件转移模型 qφ(s′|s, z)。
应用变分下界来优化互信息目标，并通过 KL 发散项来收紧它。
计算一个可处理的内在奖励 r_z(s, a, s′)，在 qφ 下促进可预测性并在 z 上实现多样性。
在潜在空间 Z 中使用 MPC 进行基于模型的规划，以组合学习到的技能用于下游任务，无需额外训练。

实验结果

研究问题

RQ1无监督技能学习是否能产生一个连续、可扩展的潜在空间，易于预测和规划？
RQ2在技能潜在空间中的规划是否能够对具有高维动态的下游任务实现零样本求解？
RQ3相比离散技能，连续技能更适合分层组合和长远规划吗？
RQ4技能可预测性如何影响行为方差和下游规划性能？
RQ5在导航与机动任务上，DADS 与标准基于模型的和目标条件 RL 基线相比如何？

主要发现

DADS 在 MuJoCo 运动任务中学习了一组具有低方差且可预测的多样技能，且无需奖励。
技能的连续潜在空间比离散技能集合产生更平滑、可插值的行为。
在学习的技能动力学上使用 MPC 的规划实现零样本任务求解，优于最先进的基于模型的 RL 基线。
使用 MPPI 的分层控制结合 DADS 技能在下游导航任务上优于基于 DIAYN 的层级结构和目标条件 RL。
连续原语变体在分层组合和下游任务性能方面优于离散变体。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。