QUICK REVIEW

[论文解读] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Albert J. Zhai, Zeng, Kuo-Hao|arXiv (Cornell University)|Feb 13, 2026

Robot Manipulation and Learning被引用 0

一句话总结

PSI 通过在仿真中过滤轨迹数据来实现面向任务的抓取与后抓策略，从人类视频中学习的模块化操作，无需机器人数据便可实现真实机器人操作。

ABSTRACT

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

研究动机与目标

从人类视频中学习操控技能以减少对机器人数据的需求。
通过将抓取与后抓动作模块化来解决具身差距。
引入基于仿真的过滤以确保抓取与任务兼容。
学习一个能够从RGB-D输入预测后抓轨迹和抓取分数的策略。

提出的方法

用6-DoF物体姿态表示演示，作为具身无关的运动轨迹。
使用仿真步骤对轨迹进行过滤并为每条轨迹分配抓取适合度标签。
训练一个行为克隆策略，输出来自RGB图像、物体掩码和2D目标点的6-DoF后抓轨迹和K个抓取分数。
将学习到的抓取评分器与任何外部抓取生成器在模块化执行管线中结合使用。
评估两种姿态跟踪管线（基于模型的FoundationPose和基于模型无关的ICP），并比较轨迹流与直接6D姿态目标的差异。

Figure 1 : Modular prehensile imitation learning. Human videos are well-suited for learning post-grasp motions but are not suitable for learning grasping for non-anthropomorphic end-effectors. Separating these subtasks via a modular policy design allows for dedicated post-grasp learning. However, ex

实验结果

研究问题

RQ1跨具身模仿是否仅使用人类视频就能学习到精确的前握操作？
RQ2基于仿真的过滤是否能产生任务兼容的抓取并提升策略性能？
RQ36-DoF姿态表示是否优于流（flow）作为从人类视频学习的表示？
RQ4PSI如何在不同机器人具身上泛化？
RQ5在HOI4D数据上进行预训练对样本效率有何影响？

主要发现

Method	P&P	Pour	Stir	Draw
No trajectory filtering (FP)	6/20	12/20	16/20	12/20
Naive grasp (FP)	5/20	8/20	10/20	1/20
Ours (FP)	16/20	13/20	20/20	12/20
No trajectory filtering (ICP)	10/20	8/20	8/20	0/20
Naive grasp (ICP)	4/20	7/20	11/20	0/20
Ours (ICP)	15/20	13/20	18/20	0/20

PSI 能在没有机器人数据的情况下训练出真实世界的操作策略，并优于天真抓取基线。
轨迹过滤结合任务导向的抓取评分显著提升四个任务的成功率。
直接预测6-DoF后抓姿态在后抓动作上优于基于流的做法。
在HOI4D上的预训练对大多数任务带来强劲提升，pour任务相对更偏重旋转。
PSI 能在多种机器人具身（xArm7、Franka Panda、Kinova Gen3、UR5e）上泛化，结果稳健。

Figure 2 : Task-compatibility for grasps. Even though a grasp may be stable, it may not be compatible with the downstream task. With a firm right hand underhand grip on the door handle (right), it becomes very difficult to turn the handle clockwise. Task-agnostic grasp generators fall short in solvi

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。