QUICK REVIEW

[论文解读] Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Lirui Wang, Xinlei Chen|arXiv (Cornell University)|Sep 30, 2024

Image Retrieval and Classification Techniques被引用 5

一句话总结

Heterogeneous Pre-trained Transformers (HPT) 通过在不同机器人实现之间使用 embodiment-specific stems 和 task-specific heads 的结构，将共享策略干线进行预训练，从而实现对新实现和新任务的迁移，在 52 个数据集和超过 1B 参数规模下提升性能与可扩展性。

ABSTRACT

One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (https://liruiw.github.io/hpt/) for code and videos.

研究动机与目标

在多样化实现和任务之间推动可扩展、具有一般化能力的机器人策略学习。
提出一个模块化架构（干线、干线、头部）以将来自不同机器人的本体感知和视觉对齐到一个共享表示。
展示在异质真实、仿真和人工视频数据集上的数据规模、模型规模和计算量对扩展性的影响。
通过有监督的预训练与微调，展示对未知实现、任务和现实世界场景的迁移性能。

提出的方法

引入 stem（本体感知分词器和视觉分词器），将异质输入映射到每种模态的固定令牌集合（如 16 个）。
使用一个共享变换器干线将拼接后的令牌处理为联合潜在表示。
使用任务特定的头部将干线输出映射到每个实现-任务对的动作。
采用行为克隆目标，使用对经归一化的动作在 K 个异质数据集上的 Huber 损失进行训练，同时对 stems/heads 和 trunk 进行联合更新。
跨多达 52 个数据集进行预训练，参数超过 1B，使得通过重新初始化 stems/heads 并冻结 trunk 即可迁移到新实现。

实验结果

研究问题

RQ1异质预训练在真实机器人、仿真和人类视频的数据量和多样性上如何扩展？
RQ2是否可以从多样化实现中学习到的单一干线实现对未见实现和任务的转移具有有效性，需要最少适应？
RQ3模型规模和批量规模对预训练收敛和下游迁移性能有何影响？
RQ4预训练的 HPT 表征在仿真基准和真实世界机器人任务的迁移性能如何？
RQ5将异质数据引入是否提升对实现与环境的鲁棒性和泛化能力？

主要发现

HPT 的规模收益来自于更大模型、更多数据以及更高的计算，数据/模态增加时验证损失更小。
在预训练中包含更多实现可提高干线的泛化和跨任务的转移性能。
将参数量达到 1B 的预训练（HPT-Huge）结合大批量训练，直到 plateau 为止持续改进，深度 vs 宽度的扩展带来较小的额外收益。
使用合成仿真数据和互联网人类视频进行预训练是可行的，提供互补的 embodiment 数据，保持了转移收益。
在迁移到仿真基准时，HPT 比从零开始训练或没有干线的方案在任务成功率上有提升；微调后的 HPT 变体（如 HPT-XL）达到比基线更高的性能。
在真实世界测试中，相较于基线，预训练策略显示出对视角配置和对象多样性的鲁棒性与泛化能力的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。