QUICK REVIEW

[论文解读] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu|arXiv (Cornell University)|Oct 10, 2024

Mechanics and Biomechanics Studies被引用 5

一句话总结

RDT-1B 提出机器人扩散变换器（RDT），这是一个 1.2B 参数的基于扩散的语言条件双手操作的基础模型，预训练于大规模多机器人数据，并在多任务双手数据集上进行微调，以在现实机器人上实现强大的零-shot 与少-shot 泛化。

ABSTRACT

Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.

研究动机与目标

通过利用大规模多机器人数据进行预训练并利用目标机器人数据进行微调，解决双手操作中的数据稀缺问题。
开发能够建模双手动作多模态且能够处理来自文本、图像和本体感知的异构输入的可扩展扩散架构。
引入物理可解释的统一动作空间，以在保持物理意义的同时统一跨机器人的人机动作表示。
展示强泛化能力，包括零-shot 与少-shot 能力、语言指令跟随，以及在真实双臂机器人上的灵巧操作。

提出的方法

将动作建模为一个条件的连续分布 p(a_t|l,o_t)，通过去噪扩散过程捕捉多模态。
使用 Diffusion Transformer (DiT) 主干，并做出架构性改动（MLP 解码器、QKNorm、RMSNorm、交替条件注入）以适应机器人数据特性。
通过低维本体感知的 MLP 和傅里叶特征对异构输入进行编码，图像通过视觉编码器（SigLIP）编码，语言通过预训练的 Transformer（T5-XXL）编码；应用输入掩蔽以防止对某一模态的过度依赖。
使用基于扩散的去噪目标对动作片段 a_t:t+T_a 进行训练，以促进时间一致性并减少误差累积。
将异构机器人动作空间统一到一个物理可解释的统一动作空间，以实现对 46 个数据集（≈1M 条轨迹，21TB）的多机器人预训练。
在大规模多机器人数据上将 RDT 预训练到 1.2B 参数，然后在目标双臂操控的自收集多任务数据集（>6K 条轨迹）上进行微调。
采用 DPM-Solver++ 加速采样，实现 6 Hz 的动作片段推断以及在硬件上的高吞吐量。

实验结果

研究问题

RQ1RDT 是否能对未见对象和场景实现零-shot 泛化？
RQ2RDT 对未见模态的零-shot 指令跟随能力有多有效？
RQ3RDT 是否能实现对之前未见技能的少-shot 学习？
RQ4RDT 是否具备完成需要细腻、灵巧操作的任务的能力？
RQ5模型规模、数据规模和扩散建模是否共同提升了性能？

主要发现

RDT 在一组双手任务中实现了最先进的性能，并在基线方法上大幅超越（如成功率提升 56%）。
RDT 展现对未见对象、场景、指令和技能的零-shot 和少-shot（1–5 次）泛化。
大型模型规模和大量的预训练数据，加上扩散建模，共同促进了优越的性能。
RDT 能在真实机器人上处理需要细腻灵巧操作的任务，并能有效遵循语言指令。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。