[论文解读] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser 将开放集语言指令在三维感知中落地,通过将三维价值映射与 LLMs 和 VLMs 组合,实现通过模型为基础的规划和 MPC 的零样本密集 6-DoF 轨迹综合。
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io
研究动机与目标
- 利用大型语言模型推断用于操作的语言条件化的可获取性与约束。
- 使用视觉-语言定据的映射将这些可获取性落地到三维观测空间。
- 在零样本设定中合成密集的机器人轨迹,无需任务特定训练。
- 展示对干扰的鲁棒性并研究在线动力学学习,以提升接触密集任务的性能。
提出的方法
- LLMs 生成用于查询感知 API 并操作 3D 体素映射的 Python 代码。
- VLMs 在 RGB-D 空间中锚定对象部件,以产生以观测为锚的空间映射。
- 将 3D 价值映射(可供性、规避以及额外映射)组合起来,通过基于体素的目标函数形成任务成本。
- 通过带有零阶轨迹采样的模型预测控制回路解决每个子任务的优化。
- 可选地使用 VoxPoser 生成的轨迹作为探索先验,在线细化动力学模型。
实验结果
研究问题
- RQ1是否可以通过基于感知定锚的 3D 价值映射的零样本综合来操作开放集指令和对象?
- RQ2LLMs 如何高效推理可供性与约束,并将其转化为可在三维观测空间落地的锚点?
- RQ3将语言锚定到三维观测空间对规划鲁棒性和泛化性的影响如何?
主要发现
- VoxPoser 在现实世界日常操作任务中取得高成功率,对干扰具有强鲁棒性。
- 对未见指令和属性的泛化优于依赖原语或学习成本地图的基线方法。
- 零样本轨迹可以作为先验,加速接触密集任务的在线动力学学习。
- 直接在三维观测空间对成本进行定锚,并结合实时视觉反馈,使闭环 MPC 能够处理动态环境。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。