QUICK REVIEW

[논문 리뷰] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Albert J. Zhai, Zeng, Kuo-Hao|arXiv (Cornell University)|2026. 02. 13.

Robot Manipulation and Learning인용 수 0

한 줄 요약

PSI는 시뮬레이션에서 궤적 데이터를 필터링하여 작업 지향적 그리핑 및 포스트-그래스 정책을 학습함으로써 인간 비디오에서 모듈식 조작을 학습하여 로봇 데이터 없이도 실제 로봇 조작을 가능하게 한다.

ABSTRACT

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

연구 동기 및 목표

인간 비디오로부터 조작 기술 학습을 동기화하여 로봇 데이터 필요성을 줄인다.
그리핑과 포스트-그래스 모션을 모듈화하여 구현체 간 격차를 해결한다.
작업에 적합한 그리스를 보장하기 위한 시뮬레이션 기반 필터링을 도입한다.
RGB-D 입력으로 포스트-그래스 궤적 및 그립 점수를 예측하는 정책을 학습한다.

제안 방법

시연을 구현체에 구애받지 않는 모션 궤적으로 표현한다 6-DoF 물체 자세로 표현한다.
시뮬레이션 단계를 사용해 궤적을 필터링하고 각 궤적에 대한 그립 적합성 레이블을 할당한다.
RGB 이미지, 물체 마스크, 2D 목표점에서 6-DoF 포스트-그래스 궤적과 K 개의 그립 점수를 출력하는 행동 복제 정책을 학습한다.
학습된 그립 점수기를 모듈식 실행 파이프라인의 외부 그립 생성기와 결합한다.
모델 기반 FoundationPose와 모델 자유 ICP 두 가지 포즈 추적 파이프라인을 평가하고 흐름(flow) 대비 6D 포즈 타깃의 차이를 비교한다.

Figure 1 : Modular prehensile imitation learning. Human videos are well-suited for learning post-grasp motions but are not suitable for learning grasping for non-anthropomorphic end-effectors. Separating these subtasks via a modular policy design allows for dedicated post-grasp learning. However, ex

실험 결과

연구 질문

RQ1교차 구현체 모방이 인간 비디오만으로도 정확한 전정 핸들링을 학습할 수 있는가?
RQ2시뮬레이션 기반 필터링이 작업 호환 그립을 생성하고 정책 성능을 향상시키는가?
RQ36-DoF 포즈가 인간 비디오에서 학습할 때 흐름(flow)보다 우수한 표현인가?
RQ4PSI가 다양한 로봇 구현체에 대해 일반화되는가?
RQ5사례 효율성을 위해 HOI4D 데이터에 대한 사전 학습의 효과는?

주요 결과

방법	P&P	따르기	저어주기	그리기
궤적 필터링 없음 (FP)	6/20	12/20	16/20	12/20
단순 그립 (FP)	5/20	8/20	10/20	1/20
제안된 방법 (FP)	16/20	13/20	20/20	12/20
궤적 필터링 없음 (ICP)	10/20	8/20	8/20	0/20
단순 그립 (ICP)	4/20	7/20	11/20	0/20
제안된 방법 (ICP)	15/20	13/20	18/20	0/20

PSI는 로봇 데이터 없이 학습된 실제 환경 조작 정책을 가능하게 하며, 순진한 그립 기준선을 능가한다.
궤적 필터링과 작업 지향적 그립 점수화는 네 가지 작업에서 성공률을 크게 향상시킨다.
포스트-그래스 액션에 대해 흐름 기반 접근보다 6-DoF 포즈 직접 예측이 우수하다.
HOI4D에서 PSI를 사전 학습하면 대부분의 작업에서 강력한 이점을 얻으며, Pour 작업은 회전 중심에 상대적으로 더 집중된다.
PSI는 xArm7, Franka Panda, Kinova Gen3, UR5e 등 여러 로봇 구현체에서 로버스트한 결과로 일반화된다.

Figure 2 : Task-compatibility for grasps. Even though a grasp may be stable, it may not be compatible with the downstream task. With a firm right hand underhand grip on the door handle (right), it becomes very difficult to turn the handle clockwise. Task-agnostic grasp generators fall short in solvi

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.