[论文解读] Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
Manipulate-Anything 是一种可扩展的自动化演示生成方法,用于现实世界的机器人操作,不需要特权状态信息或手工设计的技能,且能够处理多样对象;它实现零-shot 任务完成,并产生用于训练鲁棒行为克隆策略的数据。
Large-scale endeavors like and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 7 real-world and 14 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe Manipulate-Anything can be a scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Project page: https://robot-ma.github.io/.
研究动机与目标
- 在没有特权状态信息或手工技能的前提下,推动可扩展、多样化的机器人演示数据。
- 利用视觉-语言模型在现实世界环境中进行规划、生成动作并验证子目标。
- 实现错误恢复和多视角推理,以提高成功率和数据质量。
- 在现实世界和 RLBench 仿真中展示零-shot 任务完成能力。
- 证明来自 Manipulate-Anything 的数据可以训练出鲁棒策略,其性能可与人类演示相媲美甚至优于之。
提出的方法
- 将场景和语言指令输入到视觉-语言模型,以识别对象和子目标。
- 通过 VLM 将任务分解为具备验证条件的子目标。
- 使用就地学习将子目标的特定动作生成为6自由度末端执行器姿态或用于新技能的代码。
- 渲染多视角场景以支撑动作生成并提升推理。
- 使用基于视觉-语言模型的验证器确认子目标是否完成,如有需要重新计划。
- 在生成的演示上训练一个 PerAct 行为克隆模型,并对比人类数据进行评估。
实验结果
研究问题
- RQ1Manipulate-Anything 能否在没有特权信息的前提下,以零-shot 方式解决多样化的现实世界任务?
- RQ2Manipulate-Anything 生成的演示是否可以训练出与人类演示相当甚至优越的鲁棒行为克隆策略?
- RQ3多视角推理是否能提升操控成功率和泛化能力?
- RQ4在零-shot 和现实世界任务中,Manipulate-Anything 与 VoxPoser 和 CAP 的比较如何?
主要发现
| Method | 放置块 | 玩Jenga | 打开罐子 | 关闭盒子 | 打开盒子 | 拾取杯子 | 拿伞 | 排序芥末 | 打开酒瓶 | 灯开启 | 放置刀子 | 拾取并举起 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VoxPoser | 70.7 ± 2.31 | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | 26.7 ± 14.00 | - | - | - | - | - | - |
| CAP | 84.00 ± 16.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | 14.67 ± 4.62 | - | - | - | - | - | - |
| MA (Ours) | 96.00 ± 4.00 | 77.33 ± 6.11 | 80.00 ± 4.00 | 33.33 ± 12.86 | 29.00 ± 10.07 | 82.67 ± 14.04 | - | - | - | - | - | - |
| VoxPoser | 33.33 ± 8.33 | 96.0 ± 6.93 | 8.00 ± 4.00 | 57.3 ± 12.22 | 92.00 ± 4.00 | 96.00 ± 0.00 | - | - | - | - | - | - |
| CAP | 4.00 ± 4.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | 64.00 ± 6.93 | 14.67 ± 8.33 | 100.00 ± 0.00 | - | - | - | - | - | - |
| MA (Ours) | 61.33 ± 20.13 | 64.00 ± 6.93 | 42.00 ± 4.00 | 69.33 ± 6.11 | 52.00 ± 10.58 | 84.00 ± 6.93 | - | - | - | - | - | - |
- 在5个现实世界任务和12个 RLBench 仿真任务中实现零-shot 任务成功,在12个仿真任务中有9个胜过 VoxPoser。
- Manipulate-Anything 生成的演示使行为克隆策略在若干任务上达到或超过人类演示的性能。
- 用 MA 数据训练的策略与用人类数据训练的策略表现相近,MA 数据常常使动作分布距离人类演示更近(Chamfer Distance 更低)。
- 在现实世界实验中,MA 生成的数据在大多数任务中相对零-shot和人类数据基线具有竞争力或更优的策略表现。
- 该方法支持可扩展的数据生成,并在语言指令变异方面比 VoxPoser 提高了鲁棒性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。