QUICK REVIEW

[논문 리뷰] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang|arXiv (Cornell University)|2023. 07. 12.

Multimodal Machine Learning Applications인용 수 87

한 줄 요약

VoxPoser는 LLM과 VLM으로 3D 가치 맵을 구성하여 개방형 집합(Open-set) 언어 지시를 3D 인식에 grounding하고, 모델 기반 계획과 MPC를 통해 제로샷 밀집 6-DoF 궤도 합성을 가능하게 한다.

ABSTRACT

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io

연구 동기 및 목표

조작을 위한 언어 조건화된 어포던스와 제약을 추론하기 위해 대형 언어 모델을 활용한다.
비전-언어 기반 맵을 사용하여 이러한 어포던스를 3D 관찰 공간에 grounding한다.
작업 특화 학습 없이 제로샷 설정에서 밀집 로봇 궤도를 합성한다.
교란에 대한 강인성을 보이고 접촉 많은 작업의 성능 향상을 위해 온라인 다이나믹 학습을 연구한다.

제안 방법

LLMs는 지각 API를 질의하고 3D 보셀 맵을 조작하는 파이썬 코드를 생성한다.
VLM은 RGB-D 공간에서 물체 부위를 grounding하여 관찰에 고정된 공간 맵을 생성한다.
3D 가치 맵(어포던스, 회피, 추가 맵)을 구성하여 보셀 기반 목표를 통해 작업 비용을 형성한다.
제로 차수 궤도 샘플링으로 모델 예측 제어 루프를 통해 하위 작업별 최적화를 해결한다.
탐색의 사전 정보로 VoxPoser가 생성한 궤적을 사용하여 온라인으로 다이나믹스 모델을 미세 조정하는 것을 선택적으로 수행한다.

실험 결과

연구 질문

RQ1지각에 grounded된 3D 가치 맵의 제로샷 합성을 통해 개방형 집합 지시 및 물체를 조작할 수 있는가?
RQ2LLMs가 어포던스와 제약을 얼마나 효과적으로 판단하고 이를 3D로 groundable하게 번역할 수 있는가?
RQ3언어를 3D 관찰 공간에 grounding하는 것이 계획의 강건성과 일반화에 미치는 영향은 무엇인가?

주요 결과

VoxPoser는 실제 세계의 일상적 조작 태스크에서 높은 성공률을 달성하고 교란에 강인함이 크다.
보전된 unseen 지시 및 속성에 대한 일반화는 기본적으로 프리미티브나 학습된 코스트맵에 의존하는 기준선보다 우수하다.
제로샷 궤도는 접촉이 많은 작업의 온라인 다이나믹 학습을 가속하는 사전 정보로 작용할 수 있다.
실시간 시각 피드백으로 3D 관찰 공간에 비용을 직접 grounding하면 동적 환경을 다룰 수 있는 폐루프 MPC를 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.