QUICK REVIEW

[论文解读] Open-World Object Manipulation using Pre-trained Vision-Language Models

Austin V. Stone, Ted Xiao|arXiv (Cornell University)|Mar 2, 2023

Multimodal Machine Learning Applications被引用 24

一句话总结

MOO 利用开放词汇的视觉-语言模型，将语言指令在视觉观察中锚定，使机器人能够操作此前未见过的对象，并在对象、背景和模态之间实现泛化。

ABSTRACT

For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project's website and evaluation videos can be found at https://robot-moo.github.io/

研究动机与目标

通过将语言在视觉观察中锚定，实现对用自然语言描述的新对象类别的操作。
利用冻结的预训练视觉-语言模型对对象进行定位，并将对象位置和指令作为条件输入于学习的策略。
展示利用开放词汇检测器对未见过的对象和环境实现零样本泛化。
展示对非语言输入模态（如指示、图像）的鲁棒性，并与开放词汇导航的整合。

提出的方法

使用冻结的视觉-语言模型（OWL-ViT）在当前图像中定位指令中描述的对象。
将对象信息表示为单像素位置（预测边界框的中心）或作为多通道对象掩码添加到图像输入中。
将语言条件策略（RT-1 主干）以当前图像、指令和对象定位掩码为条件输入。
通过对一组106个物体的演示进行模仿学习进行端到端训练，同时保持VLM固定。
在真实移动式操作臂上进行1,472次评估，涵盖五个技能（拾取、靠近移动、敲击、竖直放置、放入）。
探索用于对象指示的替代输入模态（例如指指、图像、GUI 掩码），并与开放词汇导航（CoW）整合，形成 CoW-MOO。

实验结果

研究问题

RQ1在无需额外示例的情况下，MOO 能否将操作策略泛化到用语言描述的新对象类别？
RQ2通过预训练视觉-语言模型进行定位对背景、纹理和新环境的鲁棒性有何影响？
RQ3非语言输入模态能否有效地指定待定位的对象？
RQ4训练数据规模、对象多样性和模型规模对未见对象的泛化有何影响？
RQ5开放世界导航能否与开放世界操作结合以完成端到端任务？

主要发现

与 RT-1 和 VIMA 类基线相比，MOO 在对未见对象的一般化方面显著提升，尤其是对“拾取”技能。
在新环境、具有挑战性的纹理以及额外的开放世界对象中，MOO 仍然保持鲁棒性，在这些设置中超越基线。
从 VLM 定位得到的对象表示可以通过多种模态输入（文本提示、说明文字、目标图像，或人类提供的掩码）进行输入，并实现成功的定位。
消融研究表明，对未见对象的泛化对训练数据中的对象多样性敏感，且较大的模型性能更好，但在减小规模时增益减少。
与 CoW 的开词汇导航整合时，MOO 在一个统一系统中支持开放世界导航和操作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。