QUICK REVIEW

[論文レビュー] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Wenlong Huang, Chen Wang|arXiv (Cornell University)|Sep 3, 2024

Semantic Web and Ontologies被引用数 8

ひとこと要約

ReKepは、3D空間でのキーポイント間の関係を地づけるRelational Keypoint Constraints (ReKep)として manipulation tasks を表現し、言語とRGB-D観測から自動的に生成され、階層的最適化を介してリアルタイムに解決される、マルチステージ・現場のロボット操作を実現する。

ABSTRACT

Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models. Website at https://rekep-robot.github.io/.

研究の動機と目的

タスク固有のデータや環境モデルを避け、汎用的で拡張性のある制約ベース表現をロボット操作に提供する。
RGB-D入力と自然言語指示から、大規模視覚モデル（LVM）および視覚言語モデル（VLM）を用いて制約の指定を自動化する。
知覚-行動ループを通じてSE(3)エンドエフェクタ軌道を生成するためのリアルタイムかつ階層的な最適化を実現する。
現場の実ロボット上で、タスク固有データなしに、マルチステージ・現場・両手操作・反応的な操作を実証する。

提案手法

Relational Keypoint Constraints (ReKep) を、3Dキーポイントを数値コストへ写像するPython関数として定義し、f(k) ≤ 0 が制約充足を示す。
タスクを段階に分解し、段階ごとにサブゴール制約とパス制約を設定し、SE(3)エンドエフェクタ姿勢上で階層的最適化を可能にする。
段階ごとのサブゴールとパス問題を、付帯コスト（衝突回避、到達性など）の制約付き最適化を用いて解き、SciPy（Dual Annealing + SLSQP）で約1sのウォームスタートと約10 Hzの再計画を実現。
剛性仮定の下でフォワードキーポイントモデルを用い、短い時間窓（0.1s）でエンドエフェクタの運動とキーポイントのシフトを関連付け、高頻度での閉ループ制御を可能にする。
RGB-Dと自由形式言語からReKep生成を自動化するために、DINOv2 でキーポイントを提案し、GPT-4o を用いてReKep Python制約を出力する。これらはキーポイント間の算術関係（距離、内積、回転）として表現される。
SAMマスクとクラスタリングによってキーポイントを提案し、ワールド座標へ射影、20 Hzでキーポイントを追跡してリアルタイムフィードバックを実現。

実験結果

リサーチクエスチョン

RQ1ReKepは、タスク固有データを用いず、言語とRGB-D入力から manipulation 行動を自動的にフォーミュレートし合成できるか？
RQ2現場での新規オブジェクトと操作戦略への一般化性能はどれくらいか？
RQ3各システムモジュールの故障モードと全体性能への寄与は？
RQ4このアプローチはリアルタイム再計画を伴うマルチステージ、両手操作、反応的操作に対応できるか？

主な発見

タスク	VoxPoser	自動化	注釈
お茶を注ぐ	0/10	3/10	8/10
リサイクル缶	3/10	6/10	8/10
本を棚へ	0/10	3/10	6/10
箱にテープを貼る	4/10	7/10	8/10
二腕で衣類を畳む	0/10	5/10	6/10
靴を詰める	0/10	3/10	5/10
協働で折りたたみ	0/10	4/10	7/10

本フレームワークは、現場での2つのロボットプラットフォーム上で、タスク固有データや環境モデルを必要とせず、マルチステージ・現場・両手操作・反応的な操作を実現する。
LVM からの自動的な ReKep生成は、言語とRGB-D観測からオープンワールドの指定を可能にし、制約は意味的キーポイントに基づいて地固めされる。
階層的最適化により、段階ごとのサブゴールとパス制約を解くことで、リアルタイムの閉ループ制御（約10 Hz）を実現。
このアプローチは、撹乱に対する堅牢な性能と機敏さを示し、タスクと条件によって成功率は変動し、主に点追跡と提案/VLM精度に起因する故障モードが特定可能。
アブレーションでは、キーポイント追跡と提案/ VLM モジュールが故障の主な要因である一方、最適化は時間予算内で比較的堅牢であることが示された。
衣類折りたたみの研究は、GPT-4o の指示からカテゴリー別の多様な戦略が生まれることを示し、オープンエンドな戦略的行動を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。