QUICK REVIEW

[论文解读] Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Stephen James, Andrew J. Davison|arXiv (Cornell University)|Jul 7, 2017

Robot Manipulation and Learning参考文献 34被引用 132

一句话总结

Paper demonstrates 将在仿真中学习的端到端视觉运动控制通过领域随机化和CNN将图像与关节角度映射到运动速度，迁移到现实世界以完成多阶段任务。

ABSTRACT

End-to-end control for robot manipulation and grasping is emerging as an attractive alternative to traditional pipelined approaches. However, end-to-end methods tend to either be slow to train, exhibit little or no generalisability, or lack the ability to accomplish long-horizon or multi-stage tasks. In this paper, we show how two simple techniques can lead to end-to-end (image to velocity) execution of a multi-stage task, which is analogous to a simple tidying routine, without having seen a single real image. This involves locating, reaching for, and grasping a cube, then locating a basket and dropping the cube inside. To achieve this, robot trajectories are computed in a simulator, to collect a series of control velocities which accomplish the task. Then, a CNN is trained to map observed images to velocities, using domain randomisation to enable generalisation to real world images. Results show that we are able to successfully accomplish the task in the real world with the ability to generalise to novel environments, including those with dynamic lighting conditions, distractor objects, and moving objects, including the basket itself. We believe our approach to be simple, highly scalable, and capable of learning long-horizon tasks that have until now not been shown with the state-of-the-art in end-to-end robot control.

研究动机与目标

Demonstrate end-to-end visuomotor control trained purely in simulation can operate in the real world without real images.
Learn a long-horizon, multi-stage task (locate, reach, grasp, locate basket, drop cube) via simulator-generated trajectories.
Improve generalisation to real-world variations (lighting, distractors, moving objects) through domain randomisation.
Assess how auxiliary outputs and network inputs affect transfer performance.
Evaluate robustness to environmental changes and ablations to identify key transfer factors.

提出的方法

Generate a large dataset of simulated trajectories using inverse kinematics to perform a five-stage task.
Train a reactive CNN to map sequences of images and joint angles to motor velocities controlled by a PID loop.
Augment training with auxiliary outputs (cube and gripper positions) to aid learning.
Apply domain randomisation to appearance, textures, lighting, object colours, positions, distractors, and camera height to bridge the sim-to-real gap.
Use a recurrent network (LSTM) to capture state across the multi-stage task and include joint angles as part of the input.
Evaluate with a grid-based real-world test and compare performance across varied training dataset sizes and environmental conditions.

实验结果

研究问题

RQ1How does controller performance vary with training dataset size in simulation and real world?
RQ2How robust is the transferred controller to novel real-world environments (distractors, moving objects, lighting changes, camera motion)?
RQ3Which domain randomisation components most affect transfer success (textures, lighting, distractors, geometry, camera height)?
RQ4Do auxiliary outputs and inclusion of joint angles improve transfer performance?
RQ5Is the LSTM component essential for success in this multi-stage task?

主要发现

Training in simulation with domain randomisation can transfer to real-world execution of a multi-stage task (locate, reach, grasp, place) without real images.
Increasing dataset size improves real-world performance; roughly, 1 million simulated images yield 100% success in both sim and real-world tests in the baseline without distractors.
Auxiliary outputs and joint-angle inputs provide performance gains, and removing the LSTM degrades multi-stage task success.
The controller remains robust to several real-world perturbations (distractors, moving objects, lighting changes, small camera motion) but performance degrades with strong distractors or large object appearance changes.
Ablation studies show key role for LSTM in maintaining stage context and for joint-angle inputs to stabilize orientation during grasping.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。