QUICK REVIEW

[论文解读] Learning Latent Plans from Play

Corey Lynch, Mohi Khansari|arXiv (Cornell University)|Mar 5, 2019

Reinforcement Learning in Robotics参考文献 61被引用 25

一句话总结

该论文提出Play-LMP，一种自监督方法，从无标注的人类遥控游戏数据中学习解耦的潜在计划空间，使单一策略能够在18项多样化的视觉操作任务中实现泛化。尽管训练过程中未使用任务标签，Play-LMP仍实现了85.5%的平均成功率，超越了18个专家训练的策略，且展现出监督基线中未见的鲁棒性和重试行为。

ABSTRACT

Acquiring a diverse repertoire of general-purpose skills remains an open challenge for robotics. In this work, we propose self-supervising control on top of human teleoperated play data as a way to scale up skill learning. Play has two properties that make it attractive compared to conventional task demonstrations. Play is cheap, as it can be collected in large quantities quickly without task segmenting, labeling, or resetting to an initial state. Play is naturally rich, covering ~4x more interaction space than task demonstrations for the same amount of collection time. To learn control from play, we introduce Play-LMP, a self-supervised method that learns to organize play behaviors in a latent space, then reuse them at test time to achieve specific goals. Combining self-supervised control with a diverse play dataset shifts the focus of skill learning from a narrow and discrete set of tasks to the full continuum of behaviors available in an environment. We find that this combination generalizes well empirically---after self-supervising on unlabeled play, our method substantially outperforms individual expert-trained policies on 18 difficult user-specified visual manipulation tasks in a simulated robotic tabletop environment. We additionally find that play-supervised models, unlike their expert-trained counterparts, are more robust to perturbations and exhibit retrying-till-success behaviors. Finally, we find that our agent organizes its latent plan space around functional tasks, despite never being trained with task labels. Videos, code and data are available at learning-from-play.github.io

研究动机与目标

为了解决在机器人领域中获取多样化、通用技能库的挑战，而无需依赖昂贵的任务特定专家示范。
探究是否能从无标注的人类游戏数据中实现自监督学习，从而在连续交互空间中实现与任务无关的控制。
探究从游戏数据中学习的潜在计划空间是否能在无任务标签的情况下隐式组织功能性行为。
评估基于游戏数据训练的策略相较于专家示范训练的策略，在鲁棒性和泛化能力方面的表现。

提出的方法

使用无标注游戏数据的随机时间窗口进行目标条件策略训练，其中动作基于当前状态、目标状态和采样的潜在计划进行重建。
使用两个随机编码器：一个计划识别编码器，从完整序列中推断出确切行为；一个计划提议编码器，从初始和最终状态预测可能的行为。
最小化两个编码器之间的KL散度，以使计划提议与游戏中观察到的实际行为对齐。
使用单一统一模型从原始像素学习感知策略，实现对多样化测试目标的泛化。
将计划发现与策略学习解耦，使模型能够在无任务监督的情况下在潜在空间中发现功能性行为。
在推理阶段，将策略条件化于当前状态、目标状态以及从推断分布中采样的单个潜在计划。

实验结果

研究问题

RQ1从无标注的人类游戏数据中进行自监督学习，是否能使单一策略在无任务特定监督的情况下泛化于广泛的视觉操作任务？
RQ2与专家监督学习相比，从游戏数据中学习是否能产生更具鲁棒性的策略，能够重试并从失败中恢复？
RQ3即使没有任务标签，从游戏数据中发现的潜在计划空间是否能自发围绕功能性任务类别（如抽屉操作、按钮按压）组织？
RQ4在成功率和数据效率方面，单一游戏监督策略相较于多个专家训练策略的表现如何？

主要发现

单一Play-LMP策略在18项用户指定的视觉操作任务中实现了85.5%的平均成功率，优于18个专家训练的行为克隆策略（平均成功率为70.3%）。
即使仅有30分钟的游戏数据，Play-LMP仍实现了71.8%的成功率，超越了接收三倍数据（90分钟）且基于精选示范训练的专家策略。
Play-LMP模型在初始状态扰动下显著优于专家监督模型，表现出更强的分布偏移下的泛化能力。
尽管训练过程中无任务标签，Play-LMP学习到的潜在计划空间仍围绕功能性行为（如抽屉操作、按钮按压）组织，表明出现了任务发现的潜力。
游戏监督模型在失败后表现出“尝试直至成功”的行为，而专家监督模型中未观察到此类行为，表明其具备更强的适应性。
Play-LMP中计划发现与策略学习的解耦，使性能系统性优于基线方法（Play-GCBC），单个任务的绝对性能提升最高达50个百分点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。