QUICK REVIEW

[论文解读] Chasing Ghosts: Instruction Following as Bayesian State Tracking

Peter Anderson, Ayush Shrivastava|arXiv (Cornell University)|Jul 3, 2019

Multimodal Machine Learning Applications参考文献 41被引用 32

一句话总结

本论文将指令遵循建模为带有语义空间地图的贝叶斯状态跟踪，并展示其在目标预测上优于 LingUNet，同时在不对导航约束过度依赖的情况下实现了可信的 Vision-and-Language Navigation 结果。

ABSTRACT

A visually-grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN) within the framework of Bayesian state tracking - learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on-the-fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms a strong LingUNet baseline when predicting the goal location on the map. On the full VLN task, i.e. navigating to the goal location, our approach achieves promising results with less reliance on navigation constraints.

研究动机与目标

推动构建能够将语言与视觉和行动在部分可观测的三维环境中对齐的智能体。
提出一个贝叶斯状态跟踪框架（映射、滤波、策略）以根据指令预测目标位置。
利用语义空间地图表示环境几何并实现对替代轨迹的推理。
证明在目标位置预测上优于强基线神经模型，并且在不进行广泛导航约束的情况下达到可信的完整 VLN 性能。

提出的方法

通过深度输出扩展 Matterport3D，以实现语义映射。
构建时态语义空间地图 M_t，它通过对 CNN 特征的深度感知投影从第一人称视角更新。
在地图单元上实现可微的直方图滤波，以跟踪与指令推导的观测和动作对应的潜在轨迹。
使用带注意力的序列到序列模型从指令中提取潜在观测 o_t 和动作 a_t。
用基于卷积的核来建模运动，依赖于动作和地图，确保局部性和障碍物感知。
通过 LingUNet 使用判别式、学习型观测模型来计算 p(o_t | s_t, M) 以进行贝叶斯更新。
通过最小化预测信念与地面真实轨迹之间的 KL 散度进行端到端训练，必要时可使用用于到达预测目标的反应性策略。
提供一个在全局视角图上运行的策略，选择行动以导航至预测目标。

实验结果

研究问题

RQ1将指令遵循有效地框架为对语义空间地图上的贝叶斯状态跟踪吗？
RQ2相较于神经基线，显式建模潜在轨迹分布是否能提升目标定位和 VLN 性能？
RQ3在信念状态中包含代理头部信息对指令遵循有何影响？
RQ4可微分贝叶斯滤波器结合可学习的运动与观测模型在无需导航图依赖的情况下实现有竞争力的 VLN 性能吗？

主要发现

使用滤波器进行目标预测（x,y,theta）在已见/未见环境中比 LingUNet 拥有更高的鲁棒性，在所报告的指标中有平均性能提升。
移除头向信息会降低性能，强调了定向信息在遵循指令中的重要性。
在所报告的设置下，该目标预测方法在 R2R 数据集的未见环境上优于 LingUNet 基线。
完整的 VLN 结果显示一种新模型类别仅通过模仿学习训练即可获得可信的性能，没有数据增强或专门的预训练，在测试服务器上达到有意义的成功率和 SPL。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。