QUICK REVIEW

[论文解读] Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z. Zhao, Vikas Kumar|arXiv (Cornell University)|Apr 23, 2023

Robot Manipulation and Learning被引用 11

一句话总结

本文提出 ALOHA，一种低成本的双手远程操作系统，以及 ACT，一种使用变换器预测动作片段的模仿学习算法，在大约 10 分钟示范的基础上完成 6 项实际世界的精细操作任务。

ABSTRACT

Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/

研究动机与目标

证明在低成本硬件上也能通过端到端模仿学习从真实示范中学习精细操作。
开发一个紧凑、价格实惠的远程操作系统（ALOHA），以收集用于精细操作任务的高质量数据。
创建一种新颖的学习算法（ACT），在高精度任务中减少有效预测步长并缓解误差累积。
展示在一系列实际双手操作任务中，ACT 优于现有模仿学习方法。

提出的方法

引入带变换器的动作分块（ACT），其预测未来 k 个时间步的动作序列，而非单一动作。
将 ACT 训练为条件变分自编码器（CVAE），以捕捉人类示范的变异性，并使用基于变换器的编码器/解码器进行序列建模。
通过重叠动作分块并对预测进行平均来应用时间集合，生成平滑且高精度的轨迹。
使用 CVAE 实现 ACT，其中编码器输出风格变量 z，解码器（策略）输出 k 步动作序列，条件为 z 与当前观测（图像 + 关节位置）。
使用端到端像素到动作的映射（RGB 图像到关节动作），并在用 ALOHA 收集的真实世界示范上进行训练。
维持低成本的硬件方案（两台 ViperX 6-DoF 手臂加自制的 3D 打印组件）并通过从领袖机器人到跟随者的关节空间映射进行远程操作。

实验结果

研究问题

RQ1在低成本、精度有限的硬件设置上，是否能通过学习真实示范来实现对精细的双手操作？
RQ2在高精度任务中，动作分块的模仿学习方法是否能在稳定性和精度方面优于单步策略？
RQ3时间集合与基于 CVAE 的目标在从噪声的人类示范中学习方面有何影响？
RQ4在现实世界任务（如打开调味杯或插入电池）上的实际性能如何？

主要发现

ACT 在模拟和现实世界任务上显著优于现有的模仿学习方法。
在真实任务 Slide Ziploc 和 Slot Battery 上，ACT 的最终成功率分别达到 88% 和 96%，而其他方法在早期子任务后停滞。
在两个模拟任务和两个真实任务中，ACT 将最优先前方法提升了 20-59 个百分点，具体取决于任务和数据源。
组合的 ALOHA 遥控系统在约 ~$20k 的预算内构建，支持精确、接触丰富且动态的任务，具备实时数据收集工作流。
训练 ACT 需要在单个 RTX 2080 Ti GPU 上大约 5 小时，推理约 0.01 秒，适合实时控制。
用于训练的示范量大约为每个真实任务 10-20 分钟，说明数据收集高效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。