QUICK REVIEW

[论文解读] ViNT: A Foundation Model for Visual Navigation

Dhruv Shah, Ajay Sridhar|arXiv (Cornell University)|Jun 26, 2023

Multimodal Machine Learning Applications被引用 14

一句话总结

ViNT 是基于 Transformer 的视觉导航基础模型，使用多样的真实世界数据集进行训练，能够实现跨机器人与环境的零-shot 泛化；它可以通过扩散式子目标提案进行引导，并对新任务模态进行微调。

ABSTRACT

General-purpose pre-trained models ("foundation models") have enabled practitioners to produce generalizable solutions for individual machine learning problems with datasets that are significantly smaller than those required for learning from scratch. Such models are typically trained on large and diverse datasets with weak supervision, consuming much more training data than is available for any individual downstream application. In this paper, we describe the Visual Navigation Transformer (ViNT), a foundation model that aims to bring the success of general-purpose pre-trained models to vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset, and employs a flexible Transformer-based architecture to learn navigational affordances and enable efficient adaptation to a variety of downstream navigational tasks. ViNT is trained on a number of existing navigation datasets, comprising hundreds of hours of robotic navigation from a variety of different robotic platforms, and exhibits positive transfer, outperforming specialist models trained on singular datasets. ViNT can be augmented with diffusion-based subgoal proposals to explore novel environments, and can solve kilometer-scale navigation problems when equipped with long-range heuristics. ViNT can also be adapted to novel task specifications with a technique inspired by prompt-tuning, where the goal encoder is replaced by an encoding of another task modality (e.g., GPS waypoints or routing commands) embedded into the same space of goal tokens. This flexibility and ability to accommodate a variety of downstream problem domains establishes ViNT as an effective foundation model for mobile robotics. For videos, code, and model checkpoints, see our project page at https://visualnav-transformer.github.io.

研究动机与目标

旨在创建一个通用的、预训练的视觉导航策略，能够在不同机器人形态和环境之间迁移，而无需特定任务训练。
通过使用自我视角的视觉观测，通过达到图像目标子目标来学习导航。
实现零-shot 部署，并对下游导航模态（如 GPS、路线指令）进行高效微调。
利用大规模、异构的真实世界数据集来诱导广泛的导航先验和涌现行为。

提出的方法

使用一个31M参数的基于 Transformer 的架构，将过去的观测和目标图像进行标记化，并配备一个专用的目标融合编码器以获取相对目标表示。
端到端训练，采用最大似然目标来预测未来一系列动作和到目标的动力学距离。
采用基于相对航点、以机器人最高速度归一化的面向实现的姿态无关动作空间，并使用 PD 控制器执行。
通过使用 ViNT 计算时间距离和动作来对扩散式子目标提案进行地面化，为长期探索的子目标提供空间定位。
将拓扑图规划器整合为情节记忆，以支持在未见环境中的长期规划和探索。
通过一个轻量级的提示式机制，将新的任务模态映射到 ViNT 的目标标记空间来展示对新目标模态的适应性；可选地使用少量特定任务数据微调整个模型。

实验结果

研究问题

RQ1ViNT 是否能够对新的机器人和环境实现视觉导航的零-shot 泛化？
RQ2ViNT 与扩散式子目标提案和拓扑规划在长时域探索中的整合效果如何？
RQ3在数据有限的情况下，ViNT 如何高效微调或适应新的任务模态（如 GPS 路径点、路线指令）？
RQ4ViNT 是否表现出稳健的涌现导航行为并将导航先验迁移到未见任务？

主要发现

ViNT 在多种机器人和环境中实现了强零-shot 泛化，包括训练中未见的 Go 1 四足机器人。
结合扩散式子目标提案和拓扑规划器时，ViNT 在室内和室外的目标到达任务上优于基线（Table 1）。
在室内 GPS 和室外卫星情景中，ViNT 获得高成功率（室内 0.90，室外 0.95–1.00）以及有益的路径质量（如：室内 91m；室外 1270m，SPL 0.84；室外 1040m，SPL 0.94）。
仅用最多1小时的在任务数据对 ViNT 进行微调即可在新领域（如 CARLA 的自动驾驶）和新模态（Images、Positions、Routing）上实现强性能，超出仅仅图像目标。
ViNT 可以通过对共享目标标记空间的轻量映射来适应新模态，并可端到端微调以提升任务性能。
涌现行为包括隐式避免碰撞的默认行为、推断出的导航偏好（例如沿路行驶、在走廊内保持），以及对动态行人的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。