QUICK REVIEW

[論文レビュー] Offline Visual Representation Learning for Embodied Navigation

Karmesh Yadav, Ram Ramrakhya|arXiv (Cornell University)|Apr 27, 2022

Multimodal Machine Learning Applications被引用数 24

ひとこと要約

OVRLは大規模な屋内画像に対する自己教師あり学習で視覚表現をオフラインで事前学習し、その後オンラインで視覚運動機能を微調整してImageNavとObjectNav向けに最先端の結果を達成する。

ABSTRACT

How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.

研究の動機と目的

タブラ・ラーサ（白紙状態）トレーニングを超えた具象化ナビゲーションにおけるより良い視覚表現の必要性を動機づける。
視覚運動タスクのためのオフラインSSL事前学習とオンライン微調整を組み合わせた2段階戦略を提案する。
ImageNavとObjectNavに跨る事前学習表現のクロスデータセット一般化とスケーラビリティを実証する。

提案手法

Omnidata（大規模な事前レンダリング済み屋内画像データセット）上でDINO（自己教師あり学習）を用いて視覚エンコーダをオフライン事前学習する。
安定したSSLと射影ヘッドの訓練のため、GroupNormと基盤層を減らした修正ResNet50バックボーンを使用する。
ImageNavとObjectNavでの下流微調整は、画像拡張とタスク固有のアーキテクチャ（ImageNavはDD-PPO; ObjectNavは模倣学習ベース）を用いる。
微調整時のデータ拡張（カラージッター、平行移動など）を探索して、汎化と時間的一貫性を向上させる。
エンコーダの汎化を示すため、Gibson HM3D MP3Dデータセットと複数のカメラ（1 RGB, 4 RGB, RGBD）を横断して評価する。

実験結果

リサーチクエスチョン

RQ1大規模IID画像コーパスでのオフラインSSL事前学習は、未知環境やデータセットに一般化する視覚運動表現を生み出せるか？
RQ2画像拡張と微調整戦略は下流の具象的ナビゲーション性能に有意な影響を与えるか？
RQ3異なるSSLアルゴリズムとモデルサイズが、事前学習済みエンコーダとして使用した場合のImageNavとObjectNavの性能にどう影響するか？
RQ4トレーニングスケジュールを数十億ステップに拡張した際、事前学習表現の限界は何か？
RQ5多様な室内シーンデータセット（OSD）での事前学習は、従来の教師あり事前学習（例：ImageNet）より具象タスクにおいて優れているか？

主な発見

Test	Method	Pretraining Dataset	Test Split	Camera(s)	SPL (↑)
Scratch	-	A	1 RGB	9.3 ± 1.1%	17.9 ± 2.0%
ZER (ResNet9) [2]	-	A	1 RGB	21.6%	29.2%
ZER (ResNet50) ∗	-	A	1 RGB	18.8 ± 2.3%	27.7 ± 1.7%
CRL [13]	MP3D	PointNav	1 RGB	3.2%	5.8%
CRL ∗	Gibson	A	1 RGB	10.2 ± 1.6%	20.4 ± 2.8%
OVRL (Ours)	OSD	A	1 RGB	26.9 ± 0.9%	41.3 ± 1.0%
OVRL+ZER-Reward (Ours)	OSD	A	1 RGB	27.0 ± 2.5%	54.2 ± 1.4%
Mem-Aug RL [30]	✗	A	4 RGB	56.0%	69.0%
OVRL (Ours)	OSD	A	4 RGB	62.5 ± 1.3%	79.8 ± 0.7%
NRNS [19]	✗	B	1 RGBD	12.4%	24.0%
OVRL (Ours)	OSD	B	1 RGB	28.4 ± 1.7%	45.5 ± 2.7%

OVRLはImageNavのシングルRGB性能を29.2%から54.2% SRへ向上させる（絶対+25%、相対+86%）。
OVRLはObjectNavのRGBD性能を18.1%から23.2% SRへ向上（絶対+5.1%、相対+28%）。
同じ事前学習済みエンコーダは未知データセットに一般化し、MP3Dが事前学習時に見られなくてもILベースラインを上回る。
事前学習の利点は非常に長い微調整（20億フレーム）でも持続・拡大し、長い訓練で事前学習の利得が薄れるという概念に挑戦する。
微調整時の画像拡張はエンコーダを微調整した場合に性能を大幅に向上させるが、エンコーダを凍結すると拡張の効果は低下する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。