QUICK REVIEW

[論文レビュー] Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z. Zhao, Vikas Kumar|arXiv (Cornell University)|Apr 23, 2023

Robot Manipulation and Learning被引用数 11

ひとこと要約

The paper presents ALOHA, a low-cost bimanual teleoperation system, and ACT, an imitation learning algorithm that predicts action chunks with transformers, achieving 6 real-world fine manipulation tasks from about 10 minutes of demonstrations.

ABSTRACT

Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/

研究の動機と目的

Demonstrate that fine manipulation can be learned on low-cost hardware using end-to-end imitation learning from real demonstrations.
Develop a compact, affordable teleoperation setup (ALOHA) to collect high-quality data for fine manipulation tasks.
Create a novel learning algorithm (ACT) that reduces the effective horizon and mitigates compounding errors in high-precision tasks.
Show that ACT outperforms prior imitation learning methods on a suite of real-world bimanual manipulation tasks.

提案手法

Introduce Action Chunking with Transformers (ACT), which predicts a sequence of actions for the next k timesteps rather than a single action.
Train ACT as a conditional variational autoencoder (CVAE) to capture human demonstration variability and use a transformer-based encoder/decoder for sequence modeling.
Apply temporal ensembling by overlapping action chunks and averaging predictions to produce smooth, high-precision trajectories.
Implement ACT with a CVAE where the encoder outputs a style variable z and the decoder (policy) outputs k-step action sequences conditioned on z and current observations (images + joint positions).
Use end-to-end pixel-to-action mapping (RGB images to joint actions) and train on real-world demonstrations collected with ALOHA.
Maintain a low-cost hardware approach (two ViperX 6-DoF arms plus custom 3D-printed components) and teleoperation via joint-space mapping from a leader robot to a follower.

実験結果

リサーチクエスチョン

RQ1Can a low-cost, imprecise hardware setup perform fine-grained bimanual manipulation using learning from real demonstrations?
RQ2Does an action-chunking imitation learning approach improve stability and precision over one-step policies in high-precision tasks?
RQ3How do temporal ensembling and a CVAE-based objective affect learning from noisy human demonstrations?
RQ4What is the practical performance of the proposed system on real-world tasks like opening a condiment cup or slotting a battery?

主な発見

ACT significantly outperforms prior imitation learning methods on both simulated and real-world tasks.
On real tasks Slide Ziploc and Slot Battery, ACT achieves 88% and 96% final success respectively, where other methods stagnate after early subtasks.
Across two simulated and two real tasks, ACT improves the best previous method by 20-59 percentage points depending on the task and data source.
The combined ALOHA teleoperation system is built within a ~$20k budget and supports precise, contact-rich, and dynamic tasks with a real-time data collection workflow.
Training ACT requires about 5 hours on a single RTX 2080 Ti GPU, with inference around 0.01 seconds, suitable for real-time control.
Demonstrations used for training amount to about 10-20 minutes per real task, illustrating efficient data collection.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。