QUICK REVIEW

[論文レビュー] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu|arXiv (Cornell University)|Oct 10, 2024

Mechanics and Biomechanics Studies被引用数 5

ひとこと要約

RDT-1B は Robotics Diffusion Transformer を提示します。これは言語条件付き二腕操作のための1.2Bパラメータの拡散ベースのファウンデーションモデルであり、巨大なマルチロボットデータで事前学習され、マルチタスクの二腕データセットでファインチューニングされて、実ロボットに対するゼロショットおよび少数ショットの一般化が高い性能を示します。

ABSTRACT

Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.

研究の動機と目的

二腕操作のデータ不足を解消するために、大規模なマルチロボットデータを用いた事前学習とターゲットロボットデータによるファインチューニングを活用する。
二腕アクションの多様性をモデル化し、テキスト、画像、固有受容性からの不均一な入力を扱えるスケーラブルな拡散ベースのアーキテクチャを開発する。
物理的に解釈可能な統一アクション空間を導入し、ロボット間のアクション表現を統一しつつ物理的意味を保持する。
ゼロショットおよび少数ショットの能力、言語指示の追従、実ロボットのデュアルアーム操作での高い一般化を示す。

提案手法

連続条件分布 p(a_t|l,o_t) を拡散デノイジング過程で表現し、マルチモダリティを捕捉する。
構造的適応を持つ Diffusion Transformer (DiT) ボトストックを用い、アーキテクチャ的適応（MLPデコーダー、QKNorm、RMSNorm、条件の交互注入）を行い、ロボットデータの特性に対応する。
多様な入力を低次元の固有受容性と MLP および Fourier 特徴でエンコードし、画像を視覚エンコーダ（SigLIP）、言語を事前学習済み Transformer（T5-XXL）でエンコードする；モダリティ過度依存を防ぐため入力マスキングを適用する。
アクションチャンク（a_t:t+T_a）に対する拡散ベースのデノイジング目的で訓練し、時系列の一貫性を促進し誤差蓄積を低減する。
異種ロボットアクション空間をPhysically Interpretable Unified Action Space に統一し、46データセット（約1M トラジェクトリ、21TB）でのマルチロボット事前学習を可能にする。
RDT を大規模なマルチロボットデータで1.2Bパラメータまで事前学習し、その後、ターゲットデュアルアーム操作の6K超のデータを超える自己収集のマルチタスク二腕データセットでファインチューニングする。
DPM-Solver++ を用いてサンプリングを加速し、6 Hz のアクションチャンク推論とハードウェア上での高い毎秒アクションスループットを実現する。

実験結果

リサーチクエスチョン

RQ1RDT は未知のオブジェクトやシーンに対してゼロショットで一般化できるか？
RQ2未知のモダリティに対する RDT のゼロショット指示追従能力はどれほど効果的か？
RQ3RDT は以前に見ぬスキルの少数ショット学習を実現できるか？
RQ4RDT は繊細で巧妙な操作を要するタスクを遂行できるか？
RQ5モデルサイズ、データ規模、拡散モデリングは性能向上に寄与するか？

主な発見

RDT は最先端の性能を達成し、二腕タスク群におけるベースラインを大幅に上回る（例：成功率の56%向上など）。
RDT は未知のオブジェクト、シーン、指示、スキルに対するゼロショットおよび少数ショット（1–5ショット）の一般化を示す。
大規模なモデルサイズと広範な事前学習データ、および拡散モデリングの組み合わせが優れた性能に寄与する。
RDT は実ロボットの繊細な巧妙なタスクを扱い、言語指示を効果的に追従できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。