QUICK REVIEW

[论文解读] MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

Fangchen Liu, Kuan Fang|arXiv (Cornell University)|Mar 5, 2024

Multimodal Machine Learning Applications被引用 5

一句话总结

MOKA 使用 Vision-Language Model 通过基于标记的视觉提示来预测二维关键点和航点，将语言描述的任务转化为可执行的机器人动作，在开集/开箱操作设置中。

ABSTRACT

Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.

研究动机与目标

Enable open-vocabulary robotic manipulation where tasks are described in free-form language.
Bridge visual predictions from VLMs to robot motions using a compact, point-based affordance representation.
Convert affordance reasoning into visual question-answering via mark-based prompting to support zero-shot and bootstrapped learning.
Demonstrate task coverage across tool use, deformable object handling, and object rearrangement with open-ended goals.

提出的方法

Define a point-based affordance representation with keypoints (grasp, function, target) and manipulation waypoints.
Use hierarchical visual prompting to decompose language instructions into subtasks and generate affordance outputs.
Apply mark-based prompts (dots, grids, captions) on RGB images to turn continuous outputs into multiple-choice VLM responses.
Deproject 2D VLM outputs to 3D space using depth and camera parameters; generate SE(3) trajectories for grasping and manipulation.
Bootstrapping via in-context learning (adding successful trajectories as examples) and policy distillation (training a student policy from MOKA rollouts).
Compare zero-shot and in-context variants against Code-as-Policies and VoxPoser baselines on open-vocabulary tabletop tasks.

实验结果

研究问题

RQ1Can MOKA perform affordance and motion reasoning on 2D images to solve open-vocabulary manipulation tasks?
RQ2How well does translating VLM outputs into low-level motions perform on varied tasks and objects?
RQ3Can MOKA improve through real-world interactions via in-context learning or policy distillation?

主要发现

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

MOKA achieves state-of-the-art performance on four open-vocabulary manipulation tasks in a zero-shot setting and improves with in-context examples.
Zero-shot MOKA and VoxPoser have comparable results on many subtasks, with MOKA showing strengths in tool-use scenarios.
Bootstrapping with in-context examples or distilled policies further boosts success rates across subtasks.
Predicted keypoints and motions can be visually represented and executed as SE(3) trajectories in tabletop scenes.
The approach supports collecting successful trajectories as demonstrations for imitation learning or policy training (e.g., with Octo).
Failure analysis reveals reasoning vs execution failures, guiding future improvements in VLM-based affordance prediction and low-level control.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

Wiping Subtask I	Wiping Subtask II	Watch Cleaning Subtask I	Watch Cleaning Subtask II	Gift Preparation Subtask I	Gift Preparation Subtask II	Laptop Packing Subtask I	Laptop Packing Subtask II
0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9