[論文レビュー] Learning a visuomotor controller for real world robotic grasping using simulated depth images
この論文は、 simulated depth images と wrist-mounted depth sensor を用いた閉ループの visuomotor controller をロボットの把持に対して訓練し、動的補正とノイズに対する頑健性を向上させ、one-shot grasp pose detection と比較して改善を示す。シミュレーションから実機へ転移し、キネマティクス・知覚撹乱下でベースラインを上回る。
We want to build robots that are useful in unstructured real world applications, such as doing work in the household. Grasping in particular is an important skill in this domain, yet it remains a challenge. One of the key hurdles is handling unexpected changes or motion in the objects being grasped and kinematic noise or other errors in the robot. This paper proposes an approach to learning a closed-loop controller for robotic grasping that dynamically guides the gripper to the object. We use a wrist-mounted sensor to acquire depth images in front of the gripper and train a convolutional neural network to learn a distance function to true grasps for grasp configurations over an image. The training sensor data is generated in simulation, a major advantage over previous work that uses real robot experience, which is costly to obtain. Despite being trained in simulation, our approach works well on real noisy sensor images. We compare our controller in simulated and real robot experiments to a strong baseline for grasp pose detection, and find that our approach significantly outperforms the baseline in the presence of kinematic noise, perceptual errors and disturbances of the object during grasping.
研究の動機と目的
- Motivate robust grasping in unstructured real-world settings by addressing perceptual noise and object motion.
- Develop a closed-loop visuomotor controller that can correct misalignments during grasping.
- Eliminate dependence on specific viewing directions by mounting a depth sensor near the wrist.
- Train the controller entirely in simulation using depth images to reduce real-robot data requirements.
- Demonstrate transfer from simulated depth images to real robot performance and compare to a strong baseline.
提案手法
- A CNN regressor predicts distance-to-nearest-grasp given a depth image and a candidate hand offset.
- Training data generated in OpenRAVE with ray-traced depth images from 12.5k scenes containing 381 graspable objects across 10 categories.
- Distance is measured in meters in pose space with angular weighting (0.001 m/degree) for the action component.
- The network is LeNet-like with two convolutional layers, followed by two fully connected layers, and an output predicting the distance-to-go.
- Loss is L1 (regression) rather than classification to compare grasp quality across poses.
- The controller iteratively selects the action minimizing predicted distance and moves a fraction of the step, then advances in z to approach the object.
- Sampling of actions is constrained to a region around the current pose to capture local gradient information and ensure stability.
- Training used stochastic gradient descent with 900k iterations, learning rate 0.001, momentum 0.9, batch size 1000.
実験結果
リサーチクエスチョン
- RQ1Does a closed-loop visuomotor controller trained in simulation generalize to real-world depth images for grasping?
- RQ2How does the proposed distance-to-nearest-grasp CNN compare to one-shot grasp pose detection under perceptual and kinematic disturbances?
- RQ3Can wrist-mounted depth sensing enable view-invariant grasping policies across different grasp directions?
- RQ4What is the impact of kinematic noise and perceptual errors on grasp success for the proposed controller vs a strong baseline?
主な発見
| シナリオ | CTR | GPD |
|---|---|---|
| Objects in isolation | 97.5% | 97.5% |
| Clutter | 88.9% | 94.8% |
| Clutter with rotations | 77.3% | 22.5% |
- CTR matches GPD in noise-free simulations and outperforms GPD under kinematic noise in simulation.
- CTR compensates perceptual errors in single-depth images by re-grasping using new depth feedback.
- On UR5 hardware, CTR achieves 97.5% success in isolation and 88.9% in clutter, comparable to GPD (97.5% and 94.8%), but outperforms GPD when objects rotate or move during grasp.
- CTR shows robustness to object shifts during grasping, where GPD performance degrades significantly.
- Simulation-trained CNN transfers well to real depth images after processing invalid readings.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。