QUICK REVIEW

[论文解读] Scene Transformer: A unified architecture for predicting multiple agent trajectories

Jiquan Ngiam, Benjamin Caine|arXiv (Cornell University)|Jun 15, 2021

Autonomous Vehicle Technology and Safety被引用 53

一句话总结

Scene Transformer 统一了边缘和联合多智能体轨迹预测，采用情景中心的掩码序列建模方法与轴因子注意力，可对目标或其他智能体进行条件化。

ABSTRACT

Predicting the motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g. vehicles and pedestrians) and their associated behaviors may be diverse and influence one another. Most prior work have focused on predicting independent futures for each agent based on all past motion, and planning against these independent predictions. However, planning against independent predictions can make it challenging to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly, producing consistent futures that account for interactions between agents. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture employs attention to combine features across road elements, agent interactions, and time steps. We evaluate our approach on autonomous driving datasets for both marginal and joint motion prediction, and achieve state of the art performance across two popular datasets. Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy, we show that our model can unify a variety of motion prediction tasks from joint motion predictions to conditioned prediction.

研究动机与目标

通过对所有智能体的相互作用进行联合建模，而不是独立建模，来激发统一的运动预测与规划。
开发一个情景中心、置换等变的 transformer 架构，能够扩展到包含大量智能体的密集场景。
引入基于掩码的序列建模形式，使推理时能够对 AV 的目标或完整未来进行条件化。
在 Argoverse 和 Waymo Open Motion Dataset 的边缘和联合预测基准上展示最先进的性能。

提出的方法

用情景中心的张量 [A, T, D] 表示所有智能体和道路图要素。
使用轴因子自注意力（跨时间和跨智能体），通过交替层来高效地捕捉时间与智能体间交互。
通过共享道路嵌入应用跨注意力来结合道路图信息。
以受 BERT 启发的掩码序列建模目标进行训练，以支持多任务（运动预测、条件运动预测、目标条件预测）。
为每个场景解码多个未来；预测每个智能体的轨迹及相关不确定性和航向。
根据边缘与联合预测任务计算场景级别或按智能体的损失，使单一模型能够在任务之间切换。

实验结果

研究问题

RQ1单一的场景中心、基于 Transformer 的模型是否能够产生具有一致未来性的边缘与联合多智能体预测？
RQ2轴因子注意力相比完全联合注意在多智能体运动建模中是否提升了效率和性能？
RQ3掩码序列建模是否能够在不依赖特定任务结构的情况下，对 AV 的目标或完整未来进行条件化？
RQ4采用掩码策略的联合训练是否在联合预测指标上优于边缘训练？
RQ5模型在标准基准（Argoverse、Waymo Open Motion）上对边缘和联合预测的表现如何？

主要发现

在 Argoverse 和 Waymo Open Motion Dataset 的边缘运动预测基准上达到最先进水平。
在 Waymo Open Motion Dataset 的联合（互动）预测任务中，当使用联合损失训练时，优于基线。
因子化的时间轴和智能体轴注意力提供计算效率并在准确性上优于非因子化注意力。
掩码序列建模实现灵活的条件化（CMP、GCP），并支持多任务训练，而不牺牲标准 MP 性能。
证明单一模型能够执行运动预测、条件运动预测和目标条件预测，且性能下降极小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。