[论文解读] One-Shot Imitation Learning
这篇论文介绍了一种用于一次性模仿学习的元学习方法,使神经策略能够通过对该演示进行条件化并使用软注意力在未见任务之间泛化,从单次演示中模仿新任务。
Imitation learning has been commonly applied to solve different tasks in isolation. This usually requires either careful feature engineering, or a significant number of samples. This is far from what we desire: ideally, robots should be able to learn from very few demonstrations of any given task, and instantly generalize to new situations of the same task, without requiring task-specific engineering. In this paper, we propose a meta-learning framework for achieving such capability, which we call one-shot imitation learning. Specifically, we consider the setting where there is a very large set of tasks, and each task has many instantiations. For example, a task could be to stack all blocks on a table into a single tower, another task could be to place all blocks on a table into two-block towers, etc. In each case, different instances of the task would consist of different sets of blocks with different initial states. At training time, our algorithm is presented with pairs of demonstrations for a subset of all tasks. A neural net is trained that takes as input one demonstration and the current state (which initially is the initial state of the other demonstration of the pair), and outputs an action with the goal that the resulting sequence of states and actions matches as closely as possible with the second demonstration. At test time, a demonstration of a single instance of a new task is presented, and the neural net is expected to perform well on new instances of this new task. The use of soft attention allows the model to generalize to conditions and tasks unseen in the training data. We anticipate that by training this model on a much greater variety of tasks and settings, we will obtain a general system that can turn any demonstrations into robust policies that can accomplish an overwhelming variety of tasks. Videos available at https://bit.ly/nips2017-oneshot .
研究动机与目标
- 使策略能够在潜在无限的任务分布中仅凭一次演示学习新任务。
- 构建一个训练框架,使策略将(演示、当前观测)映射到对未见任务的动作。
- 表明注意力机制能够实现跨越不同任务配置和对象数量的泛化。
提出的方法
- 将策略 pi(a|o, d) 表述为以输入演示 d 和当前观测 o 为条件。
- 使用来自任务分布的演示进行训练,使得一个演示能够指导对同一任务的新实例的动作。
- 使用时间随机失活来对较长的演示进行降采样并提高泛化。
- 在块位置上应用邻域注意力以关联块并提取相关上下文信息。
- 采用三模块架构:演示网络、上下文网络和操作网络。
- 使用软注意力(以及多头注意力)处理可变长度的演示和可变对象数量。
实验结果
研究问题
- RQ1单次演示的新任务能否在未见实例上实现稳健的策略执行?
- RQ2对完整演示进行条件化是否优于对最终状态或轨迹一个有限快照进行条件化?
- RQ3在这种一次性模仿设置中,行为克隆的训练是否等同于或可与DAGGER相抗衡?
- RQ4模型在托块叠放领域对训练中未见的任务的泛化程度如何?
主要发现
- 一次性模仿方法使策略在仅有一个演示后就能在新任务实例上表现良好。
- 随着任务难度(阶段)增加,对整个演示进行条件化开始优于对最终状态的条件化。
- 具有演示降采样的时间随机失活提高了泛化并起到正则化作用。
- 在此设置中,行为克隆的表现与 DAGGER 相当,表明可能不需要交互式监督。
- 注意力可视化显示模型聚焦于少量块和对应于任务阶段的关键帧。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。