Skip to main content
QUICK REVIEW

[论文解读] Visual Prompt Guided Unified Pushing Policy

Hieu Bui, Ziyan Gao|arXiv (Cornell University)|Feb 22, 2026
Robotic Path Planning Algorithms被引用 0
一句话总结

引入一种统一的、通过提示引导的推动策略,利用流匹配和视觉提示学习多模态非抓取推推动作技能(位移、分组、分离),在真实机器人上验证并与视觉-语言模型规划器集成。

ABSTRACT

As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

研究动机与目标

  • 为多对象场景提供灵活、可重用的推动策略,而非任务特定的手工调参原语。
  • 通过示范学习一个单一的目标驱动策略,使其能够通过示范完成位移、分组和分离三种任务。
  • 通过视觉提示与任务说明符引导推动动作,实现与高级规划的兼容性。
  • 展示该策略对未见对象的泛化能力及其在基于VLM的规划器中作为低级原语的作用。

提出的方法

  • 提出一个带条件的流匹配策略,建模随时间变化的向量场,将示范转化为可执行的动作块。
  • 使用一个提示机制,将视觉提示(u1、u2)放在桌面图像上以及任务说明符(位移、分组、分离)来定义目标 g。
  • 通过条件流匹配训练以学习 p(A_t|O_t,g),使用专家示范数据集。
  • 在推理阶段应用无分类器引导(Classifier-Free Guidance),将生成的动作引导到与提示一致的轨迹。
  • 使用共享的 ResNet34 主干提取视觉输入,并通过Transformer编码器整合时间上下文,利用基于 DiT 的向量场网络与 AdaLN 条件。
  • 在真实的 ROBOTIS OpenManipulator-Y 设置上进行评估,覆盖三类任务的550个示范,并与单任务及目标图像基线进行对比。
  • 展示与视觉-语言模型规划器的集成,以低级原语解决桌面清理任务。
Figure 1 : Illustration of a specific table-cleaning task in which all red blocks must be placed in the left staging area, while blue blocks are placed in the right staging area. The numbered annotations indicate one possible sequence of actions considering the feasibility and efficiency.
Figure 1 : Illustration of a specific table-cleaning task in which all red blocks must be placed in the left staging area, while blue blocks are placed in the right staging area. The numbered annotations indicate one possible sequence of actions considering the feasibility and efficiency.

实验结果

研究问题

  • RQ1多模态推拉技能是否能够整合为统一策略,在各任务上具有竞争性能?
  • RQ2通过视觉提示和任务说明符进行提示,是否比目标图像条件对推动作的引导更优?
  • RQ3策略是否能够泛化到训练集之外的未见对象?
  • RQ4在基于VLM的规划框架中,该学习到的推动策略是否能作为有效的低级原语?

主要发现

  • 统一推动策略在各任务上的成功率高于或接近基线:位移85%,分组70%,分离65%。
  • 与单任务策略相比,统一策略在位移上提升10个百分点、在分组上提升10个百分点,同时在分离上保持一致。
  • 提示机制在所有任务上均优于目标图像条件基线,尤其在分离任务上显著(65%对30%)。
  • 在混乱场景(3个物体)下,统一策略保持强劲表现(位移70%,分组70%),而目标图像下降(各40%);到5个物体时,表现有所下降,但统一策略仍然更高(位移50%,分组70%)。
  • 对未见矩形对象的泛化显示位移、分组各70%、分离80%的成功率。
  • 该方法使VLM规划器能够通过Group与Grasp原语解决桌面清理任务,任务完成率达到50%,平均每次抓取对象数为1.52。
Figure 2 : Model Architecture. The input consists of the visual prompt and the latest $T_{\text{obs}}$ steps of image data and robot proprioception. The policy is parameterized by a Diffusion Transformer with alternating self-attention and cross-attention DiT blocks to denoise action tokens $\mathbf
Figure 2 : Model Architecture. The input consists of the visual prompt and the latest $T_{\text{obs}}$ steps of image data and robot proprioception. The policy is parameterized by a Diffusion Transformer with alternating self-attention and cross-attention DiT blocks to denoise action tokens $\mathbf

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。