QUICK REVIEW

[论文解读] VTNet: Visual Transformer Network for Object Goal Navigation

Heming Du, Xin Yu|arXiv (Cornell University)|May 20, 2021

Multimodal Machine Learning Applications参考文献 35被引用 36

一句话总结

VTNet 引入一个视觉变换器，通过将空间增强的局部对象描述符与位置全局区域描述符融合，学习空间感知的视觉表征，预训练以使视觉与导航动作对齐，然后用于端到端导航策略，在 AI2-Thor 未见环境中优于之前方法。

ABSTRACT

Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize "turning right" over "turning left" when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.

研究动机与目标

为对象目标导航提供信息丰富的视觉表征，使观测引导朝向目标对象的动作。
开发一个 Visual Transformer (VT)，利用检测到的对象及空间区域之间的关系来产生导航相关特征。
引入空间感知描述符（空间增强的局部描述符与位置全局描述符）以实现有效的注意力融合。
对 VT 进行预训练，使其将视觉表征与方向导航信号相关联，以便促进后续策略学习。
演示 VTNet 的端到端训练，并在未见环境中显示相对于最先进基线的改进。

提出的方法

使用 DETR 检测并编码场景中的所有对象实例，以保留实例之间的关系。
通过连接归一化边界框、置信度和语义标签，并通过 MLP 将其与目标指示符融合，形成 VT 编码器键的空间增强局部描述符。
通过提取全局图像特征、降低通道数，并添加区域级位置嵌入，形成 VT 解码器查询的位置信息全局描述符。
使用视觉变换器将空间增强的局部描述符（键/值）关注到位置全局描述符（查询），以生成用于导航的最终视觉表征。
通过模仿学习对 VT 进行预训练，预测最佳导航动作（通过 Dijkstra 生成的指令），为基于强化学习的策略训练提供良好初始化。
在 VT 派生表征之上，使用 A3C 训练导航策略，实现预训练后的端到端学习。

实验结果

研究问题

RQ1一个能够对所有检测到的对象实例及其空间区域进行推理的视觉变换器是否能为对象目标导航提供更有信息量的场景表征？
RQ2纳入空间增强的局部描述符和位置全局描述符是否能改善方向信号与导航效率？
RQ3一种将视觉表征与导航动作对齐的预训练方案是否有助于在未见环境中学习出更好的导航策略？

主要发现

VTNet 在 AI2-Thor 未见场景上取得比竞争基线和以前的最先进方法更高的成功率和 SPL。
基于 DETR 的对象特征相比于 Faster R-CNN 特征提升了性能，突出显示了具有全局上下文的变换器对象表征的优势。
消融研究显示 VT 解码器、全局特征和位置嵌入对于有效导航是必需的。
预训练方案至关重要；没有它，VT 也无法收敛到有用的导航策略。
VTNet 与 VTNet+TPN 超越了如 SP、SAVN 等竞争方法，证明了基于视觉变换器的视觉表征在导航中的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。