QUICK REVIEW

[论文解读] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

William Shen, Nishanth Kumar|arXiv (Cornell University)|Mar 10, 2026

Robot Manipulation and Learning被引用 0

一句话总结

TiPToP 将预训练视觉语言基础模型与 GPU 加速的 TAMP 结合，直接从图像和语言规划开放词汇的机器人操纵任务，无需机器人数据，且开源以实现跨执行体使用。

ABSTRACT

We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5} ext{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io

研究动机与目标

在没有针对任务数据收集的情况下，推动一个通用、开箱即用的操控系统，能够跨执行体工作。
将开放词汇语言指令与三维场景中的对象与关系绑定。
联合推理离散任务结构与连续运动规划，以满足几何和符号约束。
在多种机器人执行体上实现最少的设置与标定即可部署。

提出的方法

三模块架构：感知建立以对象为中心的三维场景，含每个对象的网格和候选抓取；规划使用 cuTAMP 搜索计划骨架并优化连续参数；执行按照计划轨迹并采用阻抗控制器。
Foundation-model 感知：FoundationStereo 用于深度，M2T2 用于六自由度抓取，SAM-2 用于分割，Gemini VLM 用于对象标签和符号目标的地基 grounding。
Plan-grounding：cuTAMP 枚举 PDDL 风格的骨架，初始化连续参数的粒子，对粒子进行可微分优化以满足约束，然后使用 GPU 加速的 cuRobo 生成轨迹。
单视点执行：执行为开环，在执行过程中不进行在线重新规划或视觉反馈。
扩展性：模块化设计使添加新谓词、新任务（如使用新原语擦拭）以及适应新执行体的轻量级集成成为可能。

实验结果

研究问题

RQ1TiPToP 是否能够在开放式操作任务上达到或超越最先进的视觉语言-行动模型？
RQ2TiPToP 的任务成功率和速度与在执行体示例上微调的 VL-A 模型相比如何？
RQ3模块化规划方法的主要失败模式是什么，如何减轻这些问题？
RQ4在没有特定机器人训练的情况下，TiPToP 对机器人执行体和任务的泛化能力如何？

主要发现

TiPToP 在 28 个评估场景中相较于 π0.5-DROID 达到相近或更好的成功率，在语义与多步任务方面具有优势。
开环规划通常可实现更快的执行时间，常通过规划单一时间优化轨迹并直接执行来实现。
TiPToP 更好地利用来自大型 VLM 的 grounding，识别与任务相关的对象与关系，在干扰多且语义复杂的任务上提升了性能。
常见的失败模式包括抓取失败、从凸包网格造成的场景完成错误、VLM 检测错误，以及 cuTAMP 规划失败，这些指向了针对性的改进方向。
对 UR5e 与 WidowX 手臂的模块化部署展示了跨执行体的泛化能力，设置成本适中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。