QUICK REVIEW

[论文解读] Latent Variable Sequential Set Transformers For Joint Multi-Agent Motion Prediction

Roger Girgis, Florian Golemo|arXiv (Cornell University)|Feb 19, 2021

Autonomous Vehicle Technology and Safety被引用 38

一句话总结

AutoBots 使用潜变量、基于变换器的编码器/解码器来建模集合序列，以实现联合多智能体运动预测，提供快速、多模态的未来轨迹和场景一致的预测。它在 nuScenes 与 Argoverse 上取得强劲结果，同时实现单GPU训练。

ABSTRACT

Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-agent trajectories. We refer to these architectures as "AutoBots". The encoder is a stack of interleaved temporal and social multi-head self-attention (MHSA) modules which alternately perform equivariant processing across the temporal and social dimensions. The decoder employs learnable seed parameters in combination with temporal and social MHSA modules allowing it to perform inference over the entire future scene in a single forward pass efficiently. AutoBots can produce either the trajectory of one ego-agent or a distribution over the future trajectories for all agents in the scene. For the single-agent prediction case, our model achieves top results on the global nuScenes vehicle motion prediction leaderboard, and produces strong results on the Argoverse vehicle prediction challenge. In the multi-agent setting, we evaluate on the synthetic partition of TrajNet++ dataset to showcase the model's socially-consistent predictions. We also demonstrate our model on general sequences of sets and provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. A distinguishing feature of AutoBots is that all models are trainable on a single desktop GPU (1080 Ti) in under 48h.

研究动机与目标

将多智能体运动预测建模为具有潜变量的集合序列，以捕捉多模态性。
开发具备时序和社会注意力的编码器-解码器 Transformer 架构。
通过可学习的种子参数实现对多种未来模式的单次解码。
确保相对于智能体和集合的置换等变性。
在 nuScenes、Argoverse、TrajNet++ 与 Omniglot 数据集上展示出色表现。

提出的方法

使用时序与社交多头自注意力块交错对智能体集合序列进行编码，生成上下文张量。
通过使用针对模式的可学习种子参数矩阵和基于编码器上下文的重复 MABD/MAB 层，在并行中解码多种未来模式。
通过 CNN 派生向量 M_i 提供额外环境上下文，并在智能体和时间步之间重复复制。
采用潜变量目标进行训练，利用离散 Z 与类似变分的 Q 近似后验，并辅以模式熵正则化以鼓励输出多样性。
将输出以每个智能体在每个未来时间步的分布参数（如双变量高斯）给出。
展示对称变换等价性并将推断速度与自回归基线（每个模式单次前向）进行比较。

实验结果

研究问题

RQ1一个潜变量序列集合变换器（AutoBot）是否能够同时建模时间与社会交互，以生成一致的多智能体未来？
RQ2是否通过可学习的种子参数进行单次解码，能在不进行自回归采样的情况下高效捕捉多模态未来模式？
RQ3AutoBot 在真实世界的自动驾驶基准（nuScenes、Argoverse）和合成多智能体数据集（TrajNet++）上表现如何？
RQ4模型是否能够在 Omniglot 笔画序列等任务中生成多样且场景一致的轨迹？
RQ5与自回归或逐智能体生成方法相比，AutoBot 的计算效率如何？

主要发现

指标	AutoBot-Ego (c=10)	AutoBot-Ego (ensemble)	AutoBot-Ego (test) Min ADE (5)	AutoBot-Ego (test) Min ADE (10)	漏检率 Top-5 (2m)	漏检率 Top-10 (2m)	最小FDE (1)	越出道路率
nuScenes - 最小ADE (5)	1.43	1.37	-	-	0.66	0.45	8.66	0.03
nuScenes - 最小ADE (10)	1.05	1.03	-	-	0.62	0.44	8.19	0.02

AutoBot-Ego 在 NuScenes 上取得强劲结果，具有最小 ADE(10) 较低且越界率低，其他指标具有竞争力。
将三个 AutoBot-Ego 模型进行集成可进一步提升 NuScenes 的性能。
在 Argoverse 上，AutoBot-Ego（valid）实现 Min ADE 0.73、Min FDE 1.10、Miss Rate 0.12；AutoBot-Ego（test）实现 0.89 Min ADE（前5 名）和 1.41 Min FDE（前5 名）。
在 TrajNet++ 的合成数据中，编码器/解码器中的社会注意力减少了碰撞并提升了场景级 MinADE/MinFDE。
Omniglot 任务显示 AutoBot 的笔画比 LSTM 基线更一致/风格化，包括在歧义下的现实感完成。
AutoBot-Ego 可在单 GPU（GTX 1080 Ti）上实现 nuScenes 的训练约 3 小时，推理速度在引用比较中比某些自回归基线快约 2 倍。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。