[论文解读] One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks
The paper presents a method to learn and compose primitive visuomotor policies from a single human video by meta-learning primitives and predicting primitive phases to segment and execute multi-stage tasks end-to-end from raw pixels.
We consider the problem of learning multi-stage vision-based tasks on a real robot from a single video of a human performing the task, while leveraging demonstration data of subtasks with other objects. This problem presents a number of major challenges. Video demonstrations without teleoperation are easy for humans to provide, but do not provide any direct supervision. Learning policies from raw pixels enables full generality but calls for large function approximators with many parameters to be learned. Finally, compound tasks can require impractical amounts of demonstration data, when treated as a monolithic skill. To address these challenges, we propose a method that learns both how to learn primitive behaviors from video demonstrations and how to dynamically compose these behaviors to perform multi-stage tasks by "watching" a human demonstrator. Our results on a simulated Sawyer robot and real PR2 robot illustrate our method for learning a variety of order fulfillment and kitchen serving tasks with novel objects and raw pixel inputs.
研究动机与目标
- Motivate learning multi-stage vision-based tasks from a single human video without task labels or segmentation.
- Leverage demonstrations of primitive skills with other objects to enable fast adaptation to new compound tasks.
- Develop a phase-predictor mechanism to segment demonstrations and terminate primitives during execution.
- Integrate a one-shot imitator with meta-learning to translate human demonstrations into robot policies.
- Demonstrate the approach on simulated Sawyer and real PR2 robots with novel objects and raw pixel inputs.
提出的方法
- Use domain-adaptive meta-imitation learning (DAML) to learn primitive policies from a single human demonstration augmented with teleoperated data.
- Train human and robot primitive-phase predictors to estimate completion progress of a primitive from partial demonstrations.
- Decompose a new compound human demonstration into primitives via the human phase predictor, then translate each primitive into a policy with the one-shot learner.
- Compute policies for each primitive by adapting the meta-learned parameters with a learned adaptation objective L_ψ, enabling end-to-end visuomotor policies.
- Sequentially execute primitives, using the robot phase predictor to determine when to transition to the next primitive.
- During meta-training, use primitive demonstrations across objects to learn how to imitate primitives from videos and how to compose them into new tasks.
实验结果
研究问题
- RQ1Can a robot learn to perform temporally extended tasks from a single unsegmented human video by composing learned primitives?
- RQ2Does leveraging primitive demonstrations with different objects improve one-shot imitation and composition for new compound tasks?
- RQ3Can phase prediction effectively segment demonstrations and regulate transitioning between learned primitives during execution?
- RQ4How does DAML-based one-shot imitation compare to alternatives in end-to-end visuomotor settings?
- RQ5Is the approach scalable to novel objects and raw pixel inputs in both simulation and real robots?
主要发现
- One-shot skill composition (ours) achieved 73.3% success with 1 object and 46.7% with 2 objects in simulated order fulfillment, outperforming sliding-window baselines and LSTM-based learners.
- Sliding window (no phase prediction) achieved 50.0% (1 object) and 16.7% (2 objects) success; LSTM one-shot learner (no DAML) achieved 0.0% for both settings.
- On the PR2 kitchen-serving task, the one-shot skill composition method achieved 10/20 successes for same-target and 7/20 for different-target scenarios, whereas the sliding window baseline achieved 0/20 in both cases.
- The results indicate that both phase prediction and DAML-based meta-learning are essential for effective one-shot composition of primitives from raw pixels.
- Most failures were due to one-shot imitation difficulties in grasping, suggesting future improvements in one-shot visual imitation will enhance overall performance.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。