QUICK REVIEW

[논문 리뷰] VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani|arXiv (Cornell University)|2022. 09. 30.

Neuroinflammation and Neurodegeneration Mechanisms인용 수 35

한 줄 요약

VIP는 다양한 인간 비디오에서 일반화 가능한 시각 표현과 보지 못한 로봇 작업을 위한 밀집 보상 함수를 학습하여 보상 기반 제어 및 작업 특이 데이터 없이도 few-shot 오프라인 RL을 효과적으로 가능하게 한다.

ABSTRACT

Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $ extbf{V}$alue-$ extbf{I}$mplicit $ extbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $ extbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $ extbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.

연구 동기 및 목표

다양한 로봇 조작 작업에 대한 일반화 가능한 인식 및 보상 학습의 필요성을 제시한다.
보지 못한 작업에 대해 시각 표현과 밀집 보상을 모두 산출하는 자기지도 사전 학습 목표를 제안한다.
오프라인 인간 비디오 데이터가 이중 RL 공식화를 통해 매끄럽고 목표지향적인 보상 함수를 산출할 수 있음을 보여준다.
VIP가 실제 로봇에서 few-shot 오프라인 RL을 가능하게 하고, 시뮬레이션 및 실제 작업 전반에서 성능을 개선함을 보여준다.

제안 방법

도메인 외부의 인간 비디오에서의 표현 학습을 오프라인 목표조건 RL 문제로 형식화한다.
로봇 동작을 필요로 하지 않는 가치 함수에 대한 자기 지도 이중 목표를 도출한다(Fenchel 쌍대성).
이중 목표를 시간대조 학습 objective로 해석하여 시간적으로 매끄러운 임베딩을 산출한다.
간단하고 구현 가능한 목표(임베딩 거리로서의 보상)와 부분 구간 샘플링을 이용한 실용적 학습 루프를 통해 VIP를 구체화한다.
Ego4D 데이터를 사용해 ResNet50 백본을 학습시켜 고정된 표현을 생성하고 이를 다운스트림 작업의 보상 및 인식 백본으로 사용한다.
재현성을 높이기 위한 training objective의 최소한의 PyTorch 구현(몇 줄 코드)을 제공한다.

실험 결과

연구 질문

RQ1완전히 도메인 외부의 인간 비디오로부터 보편 시각 보상을 학습할 수 있는가?
RQ2오프라인의 행동-없는 인간 비디오 데이터가 로봇 작업에 유용한 목표조건 가치 함수를 어떻게 산출하는가?
RQ3VIP에서 유도된 임베딩 공간이 밀집하고 매끄러운 보상을 제공하여 효과적인 후속 시각운동 제어를 가능하게 하는가?
RQ4최소한의 작업 특이 데이터로 실제 로봇에서 VIP가 얼마나 적은 샷의 오프라인 RL을 가능하게 하는가?

주요 결과

Environment	VIP-RWR (Pre-Trained)	VIP-BC (Pre-Trained)	R3M-RWR (Pre-Trained)	R3M-BC (Pre-Trained)	Scratch-BC (Pre-Trained)
CloseDrawer	100 ± 0	50 ± 50	80 ± 40	10 ± 30	30 ± 46
PushBottle	90 ± 30	50 ± 50	70 ± 46	50 ± 50	40 ± 48
PlaceMelon	60 ± 48	10 ± 30	0 ± 0	0 ± 0	0 ± 0
FoldTowel	90 ± 30	20 ± 40	0 ± 0	0 ± 0	0 ± 0

VIP는 Ego4D 인간 비디오를 통해 보지 못한 로봇 작업에 대해 밀집 시각 보상을 제공하며, 보상 기반 설정에서 이전 표현보다 성능이 우수하다.
VIP는 MPPI 궤적 최적화를 통해 어려운 작업에서 비트라인의 진전을 가능하게 하며, 더 강력한 계산 예산 하에 최대 약 44%의 누적 성공률에 도달한다.
온라인 RL에서 VIP 기반 표현은 기본 벤치마크보다 더 높은 누적 성공률을 보인다.
VIP는 실제 환경에서 20개의 트래젝토리로도 적은 작업 특이 데이터로 실험실 밖 오프라인 RL을 가능하게 하며, 도메인 내 VIP 변형 및 여러 벤치마크보다 우수하다.
정성적 분석은 VIP 임베딩이 시간적으로 매끄럽고 보상 지형에 방해 요소가 Baseline보다 적으며, 일부 시야에서 ground truth와의 상관관계(R2 최대 0.95)를 보이는 등 밀집 상태 보상과의 연관성이 있음이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.