[论文解读] MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting
MOKA 使用 Vision-Language Model 通过基于标记的视觉提示来预测二维关键点和航点,将语言描述的任务转化为可执行的机器人动作,在开集/开箱操作设置中。
Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
研究动机与目标
- Enable open-vocabulary robotic manipulation where tasks are described in free-form language.
- Bridge visual predictions from VLMs to robot motions using a compact, point-based affordance representation.
- Convert affordance reasoning into visual question-answering via mark-based prompting to support zero-shot and bootstrapped learning.
- Demonstrate task coverage across tool use, deformable object handling, and object rearrangement with open-ended goals.
提出的方法
- Define a point-based affordance representation with keypoints (grasp, function, target) and manipulation waypoints.
- Use hierarchical visual prompting to decompose language instructions into subtasks and generate affordance outputs.
- Apply mark-based prompts (dots, grids, captions) on RGB images to turn continuous outputs into multiple-choice VLM responses.
- Deproject 2D VLM outputs to 3D space using depth and camera parameters; generate SE(3) trajectories for grasping and manipulation.
- Bootstrapping via in-context learning (adding successful trajectories as examples) and policy distillation (training a student policy from MOKA rollouts).
- Compare zero-shot and in-context variants against Code-as-Policies and VoxPoser baselines on open-vocabulary tabletop tasks.
实验结果
研究问题
- RQ1Can MOKA perform affordance and motion reasoning on 2D images to solve open-vocabulary manipulation tasks?
- RQ2How well does translating VLM outputs into low-level motions perform on varied tasks and objects?
- RQ3Can MOKA improve through real-world interactions via in-context learning or policy distillation?
主要发现
| Wiping Subtask I | Wiping Subtask II | Watch Cleaning Subtask I | Watch Cleaning Subtask II | Gift Preparation Subtask I | Gift Preparation Subtask II | Laptop Packing Subtask I | Laptop Packing Subtask II |
|---|---|---|---|---|---|---|---|
| 0.7 | 0.6 | 0.6 | 1.0 | 1.0 | 0.7 | 0.4 | 0.8 |
| 0.6 | 0.0 | 0.6 | 0.8 | 1.0 | 0.6 | 0.5 | 0.8 |
| 0.6 | 0.6 | 0.7 | 1.0 | 1.0 | 0.7 | 0.5 | 0.8 |
| 1.0 | 0.7 | 0.8 | 0.8 | 1.0 | 0.7 | 1.0 | 1.0 |
| 0.9 | 0.9 | 0.9 | 1.0 | 1.0 | 0.9 | 1.0 | 0.9 |
- MOKA achieves state-of-the-art performance on four open-vocabulary manipulation tasks in a zero-shot setting and improves with in-context examples.
- Zero-shot MOKA and VoxPoser have comparable results on many subtasks, with MOKA showing strengths in tool-use scenarios.
- Bootstrapping with in-context examples or distilled policies further boosts success rates across subtasks.
- Predicted keypoints and motions can be visually represented and executed as SE(3) trajectories in tabletop scenes.
- The approach supports collecting successful trajectories as demonstrations for imitation learning or policy training (e.g., with Octo).
- Failure analysis reveals reasoning vs execution failures, guiding future improvements in VLM-based affordance prediction and low-level control.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。