[論文レビュー] Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
The paper presents ALOHA, a low-cost bimanual teleoperation system, and ACT, an imitation learning algorithm that predicts action chunks with transformers, achieving 6 real-world fine manipulation tasks from about 10 minutes of demonstrations.
Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/
研究の動機と目的
- Demonstrate that fine manipulation can be learned on low-cost hardware using end-to-end imitation learning from real demonstrations.
- Develop a compact, affordable teleoperation setup (ALOHA) to collect high-quality data for fine manipulation tasks.
- Create a novel learning algorithm (ACT) that reduces the effective horizon and mitigates compounding errors in high-precision tasks.
- Show that ACT outperforms prior imitation learning methods on a suite of real-world bimanual manipulation tasks.
提案手法
- Introduce Action Chunking with Transformers (ACT), which predicts a sequence of actions for the next k timesteps rather than a single action.
- Train ACT as a conditional variational autoencoder (CVAE) to capture human demonstration variability and use a transformer-based encoder/decoder for sequence modeling.
- Apply temporal ensembling by overlapping action chunks and averaging predictions to produce smooth, high-precision trajectories.
- Implement ACT with a CVAE where the encoder outputs a style variable z and the decoder (policy) outputs k-step action sequences conditioned on z and current observations (images + joint positions).
- Use end-to-end pixel-to-action mapping (RGB images to joint actions) and train on real-world demonstrations collected with ALOHA.
- Maintain a low-cost hardware approach (two ViperX 6-DoF arms plus custom 3D-printed components) and teleoperation via joint-space mapping from a leader robot to a follower.
実験結果
リサーチクエスチョン
- RQ1Can a low-cost, imprecise hardware setup perform fine-grained bimanual manipulation using learning from real demonstrations?
- RQ2Does an action-chunking imitation learning approach improve stability and precision over one-step policies in high-precision tasks?
- RQ3How do temporal ensembling and a CVAE-based objective affect learning from noisy human demonstrations?
- RQ4What is the practical performance of the proposed system on real-world tasks like opening a condiment cup or slotting a battery?
主な発見
- ACT significantly outperforms prior imitation learning methods on both simulated and real-world tasks.
- On real tasks Slide Ziploc and Slot Battery, ACT achieves 88% and 96% final success respectively, where other methods stagnate after early subtasks.
- Across two simulated and two real tasks, ACT improves the best previous method by 20-59 percentage points depending on the task and data source.
- The combined ALOHA teleoperation system is built within a ~$20k budget and supports precise, contact-rich, and dynamic tasks with a real-time data collection workflow.
- Training ACT requires about 5 hours on a single RTX 2080 Ti GPU, with inference around 0.01 seconds, suitable for real-time control.
- Demonstrations used for training amount to about 10-20 minutes per real task, illustrating efficient data collection.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。