QUICK REVIEW

[論文レビュー] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang|arXiv (Cornell University)|Jul 12, 2023

Multimodal Machine Learning Applications被引用数 87

ひとこと要約

VoxPoser は、LLMs と VLMs を用いて 3D の価値マップを構成することで、開かれた集合の言語指示を 3D 知覚に結びつけ、モデルベースの計画と MPC によるゼロショット密な 6-DoF 軌道合成を可能にする。

ABSTRACT

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io

研究の動機と目的

操作のための言語条件付けされた利用可能性と制約を推定するために大規模言語モデルを活用する。
視覚言語 grounding マップを用いて、これらの利用可能性を 3D 観測空間に結びつける。
タスク特化の学習なしに、ゼロショット設定で密なロボット軌道を合成する。
撹乱に対する頑健性を示し、接触を多く伴うタスクの性能向上のためのオンラインダイナミクス学習を検討する。

提案手法

LLMs は知覚 API を照会し、3D ボクセルマップを操作する Python コードを生成する。
VLMs は RGB-D 空間における物体部位をグラウンディングして、観測にアンカーされた空間マップを生成する。
3D 価値マップ（アフォーダンス、回避、追加のマップ）を組み合わせ、ボクセルベースの目的関数を通じてタスクコストを形成する。
ゼロ次の軌道サンプリングを用いたモデル予測制御ループで、サブタスクごとの最適化を解く。
探索のための事前情報として VoxPoser が生成する軌道を用いて、オンラインでダイナミクスモデルを任意に改良する。

実験結果

リサーチクエスチョン

RQ1知覚に grounding された 3D 価値マップのゼロショット合成により、開かれた集合の指示や物体を操作できるか？
RQ2LLMs は利用可能性と制約をどれだけ効果的に推論し、それらを ground 可能な 3D グラウンドに翻訳できるか？
RQ3言語を 3D 観測空間に grounding することが、計画の頑健性と一般化にどのような影響を与えるか？

主な発見

VoxPoser は、実世界の日常的な操作タスクで高い成功を収め、撹乱に対して強い頑健性を示す。
未見の指示や属性への一般化は、プリミティブや学習済みコストマップに依存するベースラインより優れている。
ゼロショットの軌道は、接触が多いタスクのオンラインダイナミクス学習を加速する事前知として機能する。
リアルタイムのビジュアルフィードバックを伴い、3D観測空間に直接コストをグラウンディングすることで、動的環境に対処可能な閉ループ MPC を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。