QUICK REVIEW

[论文解读] ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Arjun Majumdar, Gunjan Aggarwal|arXiv (Cornell University)|Jun 24, 2022

Multimodal Machine Learning Applications被引用 41

一句话总结

本论文提出 ZSON，一种零-shot、开放世界的 ObjectNav 方法，通过在共享的基于 CLIP 的空间中嵌入图像目标和语言来学习语义目标导航，在 ImageNav 上进行训练，在 Gibson、HM3D 和 MP3D 上进行评估。

ABSTRACT

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

研究动机与目标

解决没有奖励或演示的开放世界 ObjectNav 问题。
利用多模态 CLIP 嵌入来统一图像与文本目标。
通过图像目标导航在未标注的 3D 环境中训练可扩展的 SemanticNav 智能体。
展示对语言描述的对象目标（例如 'sink'）的零-shot 转移。
分析影响零-shot 绩效和泛化的因素。

提出的方法

将图像目标和文本目标投影到一个共同的语义嵌入空间，使用 CLIP（图像使用 CLIP_v，文本使用 CLIP_t）。
在未标注的 HM3D 环境中，用 ResNet-50 视觉编码器和基于 LSTM 的策略，在 ImageNav 的图像目标导航上训练一个 SemanticNav 智能体，使用 DD-PPO 和一个鼓励达到目标及面向目标的取向的奖励。
在部署时，通过 CLIP_t 对语言目标进行编码并与图像目标嵌入的同一语义空间进行匹配，以进行 ObjectNav 的评估。
在强化学习训练过程中使用数据增强（颜色抖动、随机平移）。
研究视觉编码器的预训练（OVRL）和训练环境的多样性对零-shot ObjectNav 性能的影响。

实验结果

研究问题

RQ1是否可以通过从图像目标学习语义目标导航策略，在零-shot 设置下实现开放世界的 ObjectNav？
RQ2基于 CLIP 的对齐是否能够实现从图像目标到语言描述对象目标的有效转移？
RQ3视觉编码器的预训练和训练环境的多样性/数量对零-shot ObjectNav 性能有何影响？
RQ4当给出复合或房间特定指令时，智能体是否能够表现出房间感觉导航？
RQ5在多样室内环境中，零-shot SemanticNav 的极限与偏见是什么？

主要发现

零-shot ObjectNav 的增益：Gibson SR 31.3%（HM3D 25.5% SPL 12.6%），MP3D SR 15.3%。
ImageNav 预训练提升零-shot ObjectNav 的 SR 约 9.4%–10.4%，更广泛的预训练加上更多环境带来显著提升（例如 HM3D SR 25.5%、MP3D SR 15.3%）。
与现有的零-shot 方法相比，ZSON 将 Gibson ImageNav 的 SR 提升到 36.9%（从 29.2%），ObjectNav 的 SR 提升到 31.3%（从 11.3%）。
在 HM3D 上，零-shot SPL 达到与状态-of-the-art 的监督方法（OVRL）相当的水平，尽管没有 ObjectNav 训练数据。
定性结果显示房间感知能力：智能体在指令下导航到“bathroom sink”，并避免厨房，并且对复合目标显示出房间推断能力。
用 800 个 HM3D 环境进行训练，在 Gibson-only 训练基础上实现了零-shot ObjectNav SR 的绝对提升 6.6%，尽管图像导航 SR 略有下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。