QUICK REVIEW

[论文解读] RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown|arXiv (Cornell University)|Dec 13, 2022

Advanced Neural Network Applications被引用 38

一句话总结

RT-1 训练一个大规模、语言条件的 Robotics Transformer，在130k real-world demonstrations 上进行训练，以实现对700+任务的零-shot 和少-shot 泛化，在真实厨房场景中评估，具备强鲁棒性和长-horizon 能力。

ABSTRACT

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io

研究动机与目标

证明一个大规模、多任务、语言条件的机器人模型能够在现实世界对新任务、新对象和新环境进行泛化。
展示数据规模和多样性对机器人泛化的影响。
在基线比较和消融设计选择中评估 RT-1，以识别有效的组件。
探索异质数据源（仿真、不同机器人）以及长时序任务执行的整合。

提出的方法

使用 FiLM 条件的 EfficientNet 将高维传感器输入（图像）和语言指令编码为紧凑的标记，并使用 Universal Sentence Encoder 进行指令嵌入。
使用 TokenLearner 减少标记数量，以实现实时的 Transformer 基策略执行。
使用解码器端的 Transformer 将图像-语言标记映射到在臂、底座和模式（臂/基座/终止）之间离散化的动作标记。
将连续动作空间离散化为每维256个区间，并使用因果交叉熵损失进行训练。
在一个大型、多任务的数据集上进行训练（约130k条演示，约700条指令），数据来自13台机器人，历时17个月。
评估在已见与未见指令上的表现、对干扰项/背景的鲁棒性，以及长时序任务序列（SayCan 中最多约50步）。

实验结果

研究问题

RQ1RT-1 是否能够学会执行大量指令并对未见的任务、对象和环境进行泛化？
RQ2数据规模、模型大小和数据多样性对现实世界机器人泛化的影响是什么？
RQ3异质数据源（仿真或不同类型机器人）是否能提高性能与泛化？
RQ4RT-1 在现实场景中处理长时序任务序列的能力如何？
RQ5在大规模机器人 Transformer 中，哪些设计选择对性能与泛化影响最大？

主要发现

RT-1 在已见指令上取得 97% 的成功率（在约 200 个任务中），比 BC-Z 和 Gato 高出 25–32 个百分点。
RT-1 在未见指令上的泛化成功率为 76%，比次优基线高出 24 个百分点。
RT-1 对干扰项（83%）和背景（59%）表现出鲁棒性，分别比基线高出 36% 和 18%。
RT-1 支持 SayCan 中最多 50 个阶段的长时序任务，并在现实厨房中对任务、对象和环境展现出强泛化能力。
引入异质数据（例如仿真、不同机器人）保持原有任务性能并提升对新情景的泛化能力。
在大规模真实世界评估中（3,000+ 次试验），RT-1 在已见/未见任务、干扰项和背景方面均优于基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。