QUICK REVIEW

[论文解读] Offline Visual Representation Learning for Embodied Navigation

Karmesh Yadav, Ram Ramrakhya|arXiv (Cornell University)|Apr 27, 2022

Multimodal Machine Learning Applications被引用 24

一句话总结

OVRL 在大规模室内图像上通过自监督学习离线预训练视觉表征，然后在线对视觉运动特征进行微调以用于 ImageNav 和 ObjectNav，取得了最先进的结果。

ABSTRACT

How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.

研究动机与目标

推动在具身导航中需要比从零开始训练更好的视觉表征的理由。
提出一种两阶段策略，结合离线 SSL 预训练与在线微调，用于视觉-运动任务。
展示预训练表征在 ImageNav 和 ObjectNav 间跨数据集的泛化能力与可扩展性。

提出的方法

在 Omnidata（一个大规模预渲染室内图像数据集）上，使用 DINO（自监督学习）离线预训练视觉编码器。
使用带 GroupNorm 的改型 ResNet50 主干网络，并减少 baseplanes，以实现稳定的 SSL 和投影头训练。
在 ImageNav 和 ObjectNav 上进行下游微调，使用图像增强和特定任务架构（ImageNav 使用 DD-PPO；ObjectNav 使用基于模仿学习的架构）。
在微调期间探索数据增强（颜色抖动、平移等）以提高泛化和时序一致性。
在 Gibson HM3D MP3D 数据集和多种相机（1 RGB、4 RGB、RGBD）上进行评估，以展示编码器的泛化能力。

实验结果

研究问题

RQ1离线 SSL 预训练在大型 IID 图像语料库上是否能够产生对未见环境和数据集具有泛化能力的视觉-运动表征？
RQ2图像增强和微调策略是否会显著影响下游的具身导航性能？
RQ3当用作预训练编码器时，不同的 SSL 算法和模型大小如何影响 ImageNav 和 ObjectNav 的性能？
RQ4当训练 schedule 延长到十亿级步数时，预训练表征的极限在哪里？
RQ5在多样化室内场景数据集（OSD）上进行预训练是否胜过传统监督预训练（如 ImageNet）用于具身任务？

主要发现

Test	Method	Pretraining Dataset	Test Split	Camera(s)	SPL (↑)
Scratch	-	A	1 RGB	9.3 ± 1.1%	17.9 ± 2.0%
ZER (ResNet9) [2]	-	A	1 RGB	21.6%	29.2%
ZER (ResNet50) ∗	-	A	1 RGB	18.8 ± 2.3%	27.7 ± 1.7%
CRL [13]	MP3D	PointNav	1 RGB	3.2%	5.8%
CRL ∗	Gibson	A	1 RGB	10.2 ± 1.6%	20.4 ± 2.8%
OVRL (Ours)	OSD	A	1 RGB	26.9 ± 0.9%	41.3 ± 1.0%
OVRL+ZER-Reward (Ours)	OSD	A	1 RGB	27.0 ± 2.5%	54.2 ± 1.4%
Mem-Aug RL [30]	✗	A	4 RGB	56.0%	69.0%
OVRL (Ours)	OSD	A	4 RGB	62.5 ± 1.3%	79.8 ± 0.7%
NRNS [19]	✗	B	1 RGBD	12.4%	24.0%
OVRL (Ours)	OSD	B	1 RGB	28.4 ± 1.7%	45.5 ± 2.7%

OVRL 将 ImageNav 单通道 RGB 性能从 29.2% 提升到 54.2% SR (+25% 绝对值，+86% 相对)。
OVRL 将 ObjectNav RGBD 性能从 18.1% 提升到 23.2% SR (+5.1% 绝对值，+28% 相对)。
同一预训练编码器在未见数据集上具泛化能力，在 MP3D 上超越 IL 基线，即使在预训练阶段未见 MP3D。
预训练的收益在非常长的微调阶段（20亿帧）仍然存在并增长，这挑战了预训练收益在长训练中减弱的观点。
在微调期间图像增强在对编码器进行微调时显著提升性能；若固定编码器，增强的效果会下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。