QUICK REVIEW

[论文解读] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown|arXiv (Cornell University)|Jul 28, 2023

Multimodal Machine Learning Applications被引用 265

一句话总结

RT-2 通过微调大规模视觉-语言模型来输出机器人动作，实现端到端控制，从而继承网络规模的视觉-语言预训练以提升泛化和 emergent semantic reasoning.

ABSTRACT

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

研究动机与目标

利用网络规模的视觉-语言预训练来提升机器人控制泛化能力。
实现一个端到端的单一模型，将观测映射到动作，同时利用语言为基础的语义。
研究在机器人任务中来自网络规模训练的涌现能力。
评估与机器人轨迹和网络数据的共同微调对性能与泛化的影响。

提出的方法

将机器人动作表示为文本标记，并训练视觉-语言模型在输出自然语言的同时输出动作标记。
在预训练的 PaLI-X 和 PaLM-E 上基于机器人轨迹与网络规模的视觉-语言任务（如 VQA、描述生成）相结合的数据进行微调。
通过与机器人数据和网络数据的共同微调，在保持网络学习概念的同时适应机器人控制。
将 6-DoF 动作空间离散化为每个维度 256 个箱，并映射到模型词汇表中的标记。
在机器人任务提示时限制解码仅使用有效的动作标记，以确保输出可执行。
通过将大型模型部署在具备多 TPU 的云服务上以实现 55B 模型的 1–3 Hz 的实时推理。

实验结果

研究问题

RQ1与基线相比，RT-2 模型对未见物体、背景和环境的泛化能力如何？
RQ2哪些来自网络规模视觉-语言预训练的涌现能力能够迁移到机器人控制？
RQ3模型规模与训练策略（共同微调 vs 从头微调）如何影响泛化？
RQ4链式推理提示能否提升 RT-2 在机器人操作中的推理能力与任务成功率？

主要发现

RT-2（PaLI-X 与 PaLM-E 变体）在对象、场景和指令的泛化能力上，相较于 RT-1 和 MOO，在多项测试中提升约 2x 到 6x。
RT-2 能实现涌现的语义推理，例如将物体放置在语义指示的位置，以及基于关系选择物体。
链式思维提示可以实现多阶段语义推理，提升计划与执行能力。
更大规模的 RT-2 模型通常具有更好的泛化能力，且与网络数据共同微调的泛化效果强于仅使用机器人数据进行微调。
在 Language-Table 仿真中，RT-2-PaLI-3B 的表现优于基线，表明网络规模预训练的收益可迁移到其他类似机器人任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。