QUICK REVIEW

[论文解读] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang|arXiv (Cornell University)|Jul 12, 2023

Multimodal Machine Learning Applications被引用 87

一句话总结

VoxPoser 将开放集语言指令在三维感知中落地，通过将三维价值映射与 LLMs 和 VLMs 组合，实现通过模型为基础的规划和 MPC 的零样本密集 6-DoF 轨迹综合。

ABSTRACT

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io

研究动机与目标

利用大型语言模型推断用于操作的语言条件化的可获取性与约束。
使用视觉-语言定据的映射将这些可获取性落地到三维观测空间。
在零样本设定中合成密集的机器人轨迹，无需任务特定训练。
展示对干扰的鲁棒性并研究在线动力学学习，以提升接触密集任务的性能。

提出的方法

LLMs 生成用于查询感知 API 并操作 3D 体素映射的 Python 代码。
VLMs 在 RGB-D 空间中锚定对象部件，以产生以观测为锚的空间映射。
将 3D 价值映射（可供性、规避以及额外映射）组合起来，通过基于体素的目标函数形成任务成本。
通过带有零阶轨迹采样的模型预测控制回路解决每个子任务的优化。
可选地使用 VoxPoser 生成的轨迹作为探索先验，在线细化动力学模型。

实验结果

研究问题

RQ1是否可以通过基于感知定锚的 3D 价值映射的零样本综合来操作开放集指令和对象？
RQ2LLMs 如何高效推理可供性与约束，并将其转化为可在三维观测空间落地的锚点？
RQ3将语言锚定到三维观测空间对规划鲁棒性和泛化性的影响如何？

主要发现

VoxPoser 在现实世界日常操作任务中取得高成功率，对干扰具有强鲁棒性。
对未见指令和属性的泛化优于依赖原语或学习成本地图的基线方法。
零样本轨迹可以作为先验，加速接触密集任务的在线动力学学习。
直接在三维观测空间对成本进行定锚，并结合实时视觉反馈，使闭环 MPC 能够处理动态环境。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。