QUICK REVIEW

[논문 리뷰] MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

Fangchen Liu, Kuan Fang|arXiv (Cornell University)|2024. 03. 05.

Multimodal Machine Learning Applications인용 수 5

한 줄 요약

MOKA는 마크 기반 시각 프롬 prompts를 가진 비전-언어 모델을 사용하여 2D 키포인트와 웨이포인트를 예측하고, 언어로 설명된 작업을 오픈-보캐뷸러 조작 설정에서 실행 가능한 로봇 모션으로 변환합니다.

ABSTRACT

Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.

연구 동기 및 목표

자유 형식 언어로 작업이 설명되는 오픈-보캐뷸러 로봇 조작 가능성 확보.
VLM의 시각 예측을 간결한 점 기반 어포던스 표현으로 로봇 모션으로 연결합니다.
마크 기반 프롬프트를 사용해 어포던스 추론을 시각적 질의-응답으로 변환하여 제로샷 및 부트스트랩 학습을 지원합니다.
오픈 엔드 목표를 가진 도구 사용, 변형 가능한 물체 취급, 물체 재배치를 커버하는 태스크를 시연합니다.

제안 방법

키포인트(grasp, function, target)와 manipulation waypoints를 포함하는 점 기반 어포던스 표현 정의
하이레벨 시각 프롬프팅을 사용해 언어 지시를 하위 태스크로 분해하고 어포던스 출력을 생성
RGB 이미지에서 마크 기반 프롬프트(점, 격자, 캡션)를 적용해 연속 출력을 다지선다형 VLM 응답으로 변환
깊이 및 카메라 매개변수를 사용해 2D VLM 출력을 3D 공간으로 디프로젝션; SE(3) 궤적을 생성해 그립 및 조작
맥락 학습을 활용한 부트스트핑(성공적 궤적 예시 추가)과 정책 증류(MOKA 롤아웃에서 학생 정책 학습)
제로샷 및 맥락 기반 변형을 Codes as Policies 및 VoxPoser 베이스라인과 비교하여 open-vocabulary 탁상 작업에서 평가

실험 결과

연구 질문

RQ1MOKA가 2D 이미지에서 어포던스와 모션 추론을 수행해 open-vocabulary 조작 태스크를 해결할 수 있는가?
RQ2VLM 출력의 저수준 모션으로의 변환이 다양한 태스크와 물체에서 얼마나 잘 수행되는가?
RQ3실세계 상호작용을 통한 맥락 기반 학습이나 정책 증류를 통해 MOKA가 향상될 수 있는가?

주요 결과

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

MOKA는 제로샷 설정에서 네 가지 open-vocabulary 조작 태스크에서 최첨단 성능을 달성하고, 맥락 예시를 통해 성능이 향상됩니다.
제로샷 MOKA와 VoxPoser는 많은 하위 작업에서 비슷한 결과를 보이며, MOKA는 도구 사용 시 강점을 보입니다.
맥락 예시나 증류된 정책으로의 부트스트래핑이 하위 작업 전반에서 성공률을 추가로 개선합니다.
예측된 키포인트와 모션은 시각적으로 표현 가능하며 테이블탑 장면에서 SE(3) 궤적으로 실행될 수 있습니다.
성공적인 궤적을 시연으로 모아 모방 학습이나 정책 학습(예: Octo)용으로 활용할 수 있습니다.
오류 분석은 추리 실패와 실행 실패를 구분해 VLM 기반 어포던스 예측 및 저수준 제어의 향후 개선 방향을 제시합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9