QUICK REVIEW

[論文レビュー] CLIPort: What and Where Pathways for Robotic Manipulation

Mohit Shridhar, Lucas Manuelli|arXiv (Cornell University)|Sep 24, 2021

Multimodal Machine Learning Applications参考文献 65被引用数 99

ひとこと要約

CLIPort は、CLIP からの意味的ストリームと空間的 Transporter ベースのストリームを融合させ、言語を細粒度の動作へグラウンディングする二-stream の言語条件付き操作フレームワークを導入し、シミュレーションと実ロボットでの少数ショットおよびマルチタスク一般化を高い性能で実現する。

ABSTRACT

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.

研究の動機と目的

抽象的な意味概念（何を）を操作のための正確な空間的動作（どこで）へグラウンディングする。
タスク間で概念を転移させる言語条件付き制御を有効にする。
少数のデモンストレーションでデータ効率の良い学習を実現し、マルチタスク学習をサポートする。
最小データでシミュレーションから実世界のロボティクスへの転移を実証する。

提案手法

事前学習済み CLIP 特徴で条件づけられた意味ストリームと RGB-D 入力を扱う空間ストリームの二-stream アーキテクチャを採用する。
Transporter 風の FCN を用いたピックとプレースのアフォーダンス予測を Q-関数として操作化する。
意味ストリームを CLIP 言語エンコードに条件づけ、言語特徴をデコーダ層にタイル状に組み込む。
ピクセル単位の行動マップに対するクロスエントロピー損失を用いたデモンストレーションからの模倣学習で訓練する。
平行移動対称性を持つネットワークを用いた二段階のアクションプリミティブ（開始エンドエフェクタ位置姿勢）を使用する。
デモンストレーション間でタスクと属性をランダム化して、マルチタスクおよび未知属性の一般化に拡張する。

実験結果

リサーチクエスチョン

RQ1言語条件付きの二-stream アーキテクチャは、単一ストリームやベースライン手法と比して、細粒度な操作に対してどれほど有効か？
RQ2単一のマルチタスクモデルは、未知属性を含む複数の言語条件付きタスクに対して一般化できるか？
RQ3意味属性（色、形、物体カテゴリ）が、見られた状況と見られていない状況の双方にどの程度一般化するか？
RQ4限られたデータで、シミュレーションから実世界のロボティック操作への転移はどの程度うまくいくか？

主な発見

二-stream CLIPport は Transporter 単独や CLIP 単独のベースラインを上回り、デモンストレーション数が少なくても高い成功率を達成する（例：単一タスクの CLIPort は 100 デモンストレーションで 90% 超え）。
10 タスクで訓練されたマルチタスク CLIPport モデルは、多くのタスクで単一タスクモデルと同等かそれを上回ることができ、効果的なタスク間一般化を示す。
見られた属性では CLIPort（単一）は良好に機能する。未知属性ではグラウンディングは難しいが、マルチタスク設定での明示的転移（CLIPort multi-attr）は性能を大幅に向上させる。
実世界ロボット実験では、約 179 の image-action ペアで訓練されたマルチタスクモデルが 9 タスクで有意義な成功を達成し、単純なタスクでの性能は約 70%程度。
未知属性は全体的に性能が低下するが、タスク間での意味的転送を活用すると利点が現れる（例：ピンク色ブロックが未知カラータスクの解決を助ける）。
このフレームワークは few-shot 設定でデータ効率を示し、複数タスクの単一方策を訓練することをサポートし、単一タスク方策と同等以上の競争力を持つ。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。