QUICK REVIEW

[論文レビュー] VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani|arXiv (Cornell University)|Sep 30, 2022

Neuroinflammation and Neurodegeneration Mechanisms被引用数 35

ひとこと要約

VIPは多様な人間動画から普遍的な視覚表現と未見のロボット課題の密な報酬関数を学習し、タスク固有データなしでも効果的な報酬ベースの制御と少数ショットのオフライン強化学習を可能にします。

ABSTRACT

Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $ extbf{V}$alue-$ extbf{I}$mplicit $ extbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $ extbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $ extbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.

研究の動機と目的

diverse robotic manipulation tasks の generalizable perception と reward learning の必要性を動機づける。
unseen tasks のための視覚表現と密な報酬の両方を生み出す自己教師付き事前学習 objective を提案する。
offline の人間動画データが dual RL 形式で滑らかで目標指向の報酬関数を生み出せることを示す。
VIP が現実世界のロボットでの few-shot offline RL を可能にし、シミュレートされたタスクと実機タスクの両方で性能を向上させる。

提案手法

ドメイン外の人間動画からの表現学習をオフラインのゴール条件付き RL 問題として定式化する。
ロボットの動作を必要としない価値関数上の自己教師付きの dual objective を導出する（Fenchel 双対性）。
dual objective を時系列的に滑らかな埋め込みを生み出す暗黙の time-contrastive 学習 objective として解釈する。
実装可能な simple objective（埋め込み距離を報酬とする）とサブ軌道サンプリングを含む実用的な訓練ループを用いて VIP を具体化する。
downstream タスクの報酬と知覚 backbone の両方として機能する frozen 表現を作るために Ego4D データ上で ResNet50 を訓練する。
再現性を促進するため、訓練 objective の最小限の PyTorch 実装（数行レベルのコード）を提供する。

実験結果

リサーチクエスチョン

RQ1 out-of-domain の人間動画のみから universal visual reward function を学習できるか？
RQ2 offline で action-free な人間動画データがロボット課題の有用なゴール条件付き価値関数を生み出せるか？
RQ3 VIP 由来の埋め込み空間は密で滑らかな報酬を提供し、下流の視覚-運動制御を効果的に可能にするか？
RQ4 VIP は最小限のタスク固有データで現実世界のロボットに対して few-shot offline RL をどの程度可能にするか？

主な発見

環境	VIP-RWR（事前学習済み）	VIP-BC（事前学習済み）	R3M-RWR（事前学習済み）	R3M-BC（事前学習済み）	Scratch-BC（事前学習済み）
CloseDrawer	100 ± 0	50 ± 50	80 ± 40	10 ± 30	30 ± 46
PushBottle	90 ± 30	50 ± 50	70 ± 46	50 ± 50	40 ± 48
PlaceMelon	60 ± 48	10 ± 30	0 ± 0	0 ± 0	0 ± 0
FoldTowel	90 ± 30	20 ± 40	0 ± 0	0 ± 0	0 ± 0

Ego4D の人間動画で訓練された VIP は未知のロボット課題に対して密な視覚報酬を提供し、報酬ベースの設定で従来の表現よりも優れている。
MPPI 脚本最適化を用いた難易度の高いタスクで非自明な進展を達成し、より強力な計算資源予算下で集計成功率を最大約44%まで達成。
オンライン RL では VIP ベースの表現がベースラインよりも大幅に高い集計成功率をもたらす。
現実世界での few-shot offline RL を 20 軌跡程度で実現し、インドメイン VIP バリアントやいくつかのベースラインを上回る。
定性的分析により VIP 埋め込みは時間的に滑らかで、基準と比較してバンプが少ない報酬風景を持ち、いくつかのビューで真値と密な状態報酬と相関する（R2 は ground truth で最大 0.95）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。