QUICK REVIEW

[论文解读] RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

Songming Liu, Binghui Li|arXiv (Cornell University)|Feb 3, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

RDT2 是一个 7B 视觉-语言-行动模型，在超过 10,000 小时的 UMI 数据上进行训练，以实现对未见对象、场景、指令和具身方式的零样本泛化，采用三阶段训练流程，结合 RVQ 离散化、扩散式行动学习和扩散蒸馏实现实时推理。

ABSTRACT

Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See https://rdt-robotics.github.io/rdt2/ for more information.

研究动机与目标

解决机器人视觉-语言-行动模型在数据稀缺条件下的跨具身泛化与挑战性任务的部署问题。
在新型机器人和开放词汇任务上实现零样本部署。
利用大规模的具身无关数据提升对对象、场景、指令和具身方式的泛化能力。
展示适用于动态机器人任务的快速、实时推理能力。

提出的方法

三阶段训练流程，结合离散行动代币与连续行动学习。
阶段1：使用残差向量量化（RVQ）对连续行动进行离散化，并以交叉熵对 VLM 进行预训练。
阶段2：冻结 VLM，训练基于扩散的行动专家，使用流对齐损失生成连续行动。
阶段3：将扩散策略蒸馏为单步生成器，以实现超快推理。

实验结果

研究问题

RQ1RDT2 是否可以在不进行微调的情况下，对未见的具身、对象、场景和指令实现零样本泛化？
RQ2数据规模和模型大小如何影响 RDT2 的泛化能力（规模定律）？
RQ3在对具有挑战性的灵巧任务、长时输出和动态任务进行微调时，RDT2 与最先进的VLAs相比有何差异？
RQ4每个训练组件（RVQ、扩散、蒸馏）对性能的贡献是什么？
RQ5大规模具身无关数据对跨具身转移的影响有多大？

主要发现

RDT2 在开放词汇任务上实现对未见对象、场景、指令和具身的零样本泛化。
同时扩大模型规模和数据规模可获得持续的性能提升，符合可识别的尺度定律。
RDT2 在可变形对象操作、长时任务以及如乒乓球等动态任务上，优于基线如 π0-FAST 和 π0.5。
阶段2 的基于扩散的行动学习在保持性能的同时提高推理效率。
阶段3 的扩散蒸馏实现超快的一步行动生成，适用于实时任务。
消融实验证实 AR+Diffusion 训练、RVQ 离散化和蒸馏组件的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。