QUICK REVIEW

[論文レビュー] Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation

Ria Doshi, Homer Walke|arXiv (Cornell University)|Aug 21, 2024

Geography and Education Methods被引用数 5

ひとこと要約

CrossFormer は 900K の軌跡にまたがる 20 のロボット実体で訓練されたトランスフォーマー型ポリシーで、空間アライメントを必要とせず多様なロボットを制御でき、専門ポリシーに匹敵し、従来のクロスエンボディメソッドを上回る。

ABSTRACT

Modern machine learning systems rely on large datasets to attain broad generalization, and this often poses a challenge in robot learning, where each robotic platform and task might have only a small dataset. By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets, which in turn can lead to better generalization and robustness. However, training a single policy on multi-robot data is challenging because robots can have widely varying sensors, actuators, and control frequencies. We propose CrossFormer, a scalable and flexible transformer-based policy that can consume data from any embodiment. We train CrossFormer on the largest and most diverse dataset to date, 900K trajectories across 20 different robot embodiments. We demonstrate that the same network weights can control vastly different robots, including single and dual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds. Unlike prior work, our model does not require manual alignment of the observation or action spaces. Extensive experiments in the real world show that our method matches the performance of specialist policies tailored for each embodiment, while also significantly outperforming the prior state of the art in cross-embodiment learning.

研究の動機と目的

多様で複数の実体データを活用した一般的なロボットポリシーの構築を推進する。
多くのロボットからの異種の観測と行動を扱える singlePolicy アーキテクチャを開発する。
クロスエンボディ訓練が、操作・ナビゲーション・移動・航空タスクの専門ポリシーと対等であることを示す。
観測/行動空間のアラインメントを必要とする従来のクロスエンボディ法より本手法が上回ることを示す。

提案手法

変動する観測空間と行動空間を扱うために、入力と出力をシーケンスとして扱う堅牢なトランスフォーマーベースのポリシーであるCrossFormerを導入する。
複数モダリティ（画像と固有感覚）からの観測をトークン化し、単一の入力シーケンスに組み立てる。
言語指示やゴール画像によるタスク仕様を取り入れ、可能な場合にはFiLMを用いて言語と画像を融合する。
シーケンスにアクションリードアウトトークンを挿入し、モダリティ別のアクションヘッドを付加して次元的に適切なアクションを生成する。
アクションチャンク化を用いて時間的一貫性を改善し、高頻度制御タスクでの累積誤差を低減する。
20のロボット実体に跨る900K軌跡データセットで訓練し、12層のトランスフォーマー、8つの注意ヘッド、512の埋め込みサイズ、130Mパラメータ。

実験結果

リサーチクエスチョン

RQ1多様なロボットデータで訓練された単一のクロスエンボディポリシーは、ターゲットのみデータで訓練されたポリシーと同等の性能を発揮できるか？
RQ2各設定でクロスエンボディアプローチは、従来の最良の模倣学習手法と同等かそれを上回るか？
RQ3強力なクロスエンボディ性能には観測と行動空間の手動アラインメントが必要か？
RQ4学習データに明示的に含まれないロボットやタスクへどの程度一般化するか（ゼロショット/限定的シフトの状況）？

主な発見

CrossFormerは、ターゲット実体データで訓練された専門ポリシーと同等の性能を示す。
CrossFormerは評価対象の実体全体で平均73%の成功率を示し、平均で67%（ターゲットのみ）のベースラインを上回る。
CrossFormerは比較が存在する各設定で従来の最良の模倣学習手法を上回る（平均: 73% 対 51%）。
CrossFormerは観測と行動空間をアラインメントする従来手法（Yang et al. 2023/2024）を、ナビゲーションと操作の両方のタスクで顕著に上回る。
このアプローチは、強力なクロスエンボディ性能を達成するために、観測と行動空間の手動アラインメントを必要としない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。