QUICK REVIEW

[論文レビュー] Visual Prompt Guided Unified Pushing Policy

Hieu Bui, Ziyan Gao|arXiv (Cornell University)|Feb 22, 2026

Robotic Path Planning Algorithms被引用数 0

ひとこと要約

統一されたプロンプト誘導のプッシュポリシーを導入。フローマッチングとビジュアルプロンプトを介して多モーダル非把持プッシュスキル（変位、グルーピング、分離）を学習し、実機ロボットで検証、Vision-Language Modelプランナーと統合。

ABSTRACT

As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

研究の動機と目的

タスク固有の手動調整プリミティブを超えた多-objectシーン向けに柔軟で再利用可能な推しプッシュポリシーを動機づける。
デモンストレーションを通じて、単一の目的指向ポリシーが変位、グルーピング、分離を実行できるように学習する。
視覚的プロンプトとタスク指定子でプッシュアクションを誘導することで、高レベルのプランニングとの互換性を確保する。
学習したプッシュポリシーが未知のオブジェクトに対して一般化し、Vision-Language Modelベースのプランナーの低レベルプリミティブとして有用であることを示す。）
objective_ja undefined?
method_ja undefined?
research_questions_ja undefined?
key_findings_ja undefined?
table_headers_ja undefined?
table_rows_ja undefined?

提案手法

ゴール条件付きフローマッチングポリシーを提案。これによりデモンストレーションを実行可能なアクションチャンクへ変換する時系列ベクトル場をモデル化する。
テーブル画像上の視覚プロンプト(u1, u2)とタスク指定子（変位、グルーピング、分離）を組み合わせてゴールgを定義するプロンプティング機構を使用する。
Conditional Flow Matchingを用いてp(A_t|O_t,g)を専門家デモデータセットで学習する。
推論時にはClassifier-Free Guidanceを適用して生成アクションをプロンプト整合トラジェクトリへ誘導する。
視覚入力を共有ResNet34バックボーンで処理し、Transformer Encoderで時系列コンテキストを統合し、AdaLN条件付きのDiTベースベクトル場ネットワークを使用する。
3つのタスクをカバーする550件のデモで実機ROBOTIS OpenManipulator-Yで評価し、単一タスクおよびゴール画像ベースのベースラインと比較する。
Vision-Language Modelプランナーとの統合を示し、テーブル清掃タスクを低レベルプリミティブとして解く。

Figure 1 : Illustration of a specific table-cleaning task in which all red blocks must be placed in the left staging area, while blue blocks are placed in the right staging area. The numbered annotations indicate one possible sequence of actions considering the feasibility and efficiency.

実験結果

リサーチクエスチョン

RQ1マルチモーダルなプッシュスキルを統一ポリシーに統合して、タスク間で競争力のある性能を得られるか。
RQ2視覚プロンプトとタスク指定子を用いたプロンプティングは、プッシュ動作を導くためのゴール画像条件付けより優れているか。
RQ3訓練セット外の未知オブジェクトへポリシーは一般化できるか。
RQ4VLMベースのプランニングフレームワーク内で、学習したプッシュポリシーは低レベルプリミティブとして有効か。

主な発見

統一ポリシーは、タスク間で基準より高いまたは同等の成功率を達成：変位85%、グルーピング70%、分離65%。
単一タスクポリシーと比較して、統一ポリシーは変位を10ポイント、グルーピングを10ポイント改善し、分離は同等。
プロンプティング機構はゴール画像条件付きベースラインより全タスクで優れており、特に分離で65%対30%の差。
混雑したシーン（3物体）では統一ポリシーの性能が高いまま（変位70%、グルーピング70%）だが、ゴール画像は低下（各40%）; 物体5つでは性能は低下するが、統一ポリシーは依然として高水準（変位50%、グルーピング70%）を維持。
未知の長方形オブジェクトへの一般化は、変位70%、グルーピング70%、分離80%の成功を示す。
本アプローチはVLMプランナーがグループおよびグラスポリミティブを用いてテーブル清掃タスクを解くのを可能にし、平均1回あたり1.52個の物体で50%のタスク完了を達成。

Figure 2 : Model Architecture. The input consists of the visual prompt and the latest $T_{\text{obs}}$ steps of image data and robot proprioception. The policy is parameterized by a Diffusion Transformer with alternating self-attention and cross-attention DiT blocks to denoise action tokens $\mathbf

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。