QUICK REVIEW

[论文解读] Scene Transformer: A unified multi-task model for behavior prediction and planning

Jiquan Ngiam, Benjamin Caine|arXiv (Cornell University)|Jun 15, 2021

Autonomous Vehicle Technology and Safety被引用 48

一句话总结

本文提出 Scene Transformer，一种统一的多任务模型，通过在 Transformer 架构中采用掩码策略，联合预测智能体行为并实现规划。通过跨智能体、道路要素和时间步的注意力机制，它动态建模交互关系，在行为预测基准上实现了最先进性能，证明了单一模型在多样化运动预测与规划任务中的有效性。

ABSTRACT

Predicting the future motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence each other. Most prior work has focused on first predicting independent futures for each agent based on all past motion, and then planning against these independent predictions. However, planning against fixed predictions can suffer from the inability to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly in real-world driving environments in a unified manner. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture fuses heterogeneous world state in a unified Transformer architecture by employing attention across road elements, agent interactions and time steps. We evaluate our approach on autonomous driving datasets for behavior prediction, and achieve state-of-the-art performance. Our work demonstrates that formulating the problem of behavior prediction in a unified architecture with a masking strategy may allow us to have a single model that can perform multiple motion prediction and planning related tasks effectively.

研究动机与目标

解决多智能体自动驾驶场景中独立未来预测的局限性。
通过联合建模智能体行为与交互，提升规划的鲁棒性。
在单一、灵活的架构下统一多种运动预测与规划任务。
利用跨智能体、道路要素和时间步的注意力机制，实现对整体场景的全面理解。
证明单一模型可通过掩码策略有效处理多样化的预测与规划查询。

提出的方法

模型使用掩码自注意力机制，其中掩码作为查询，生成多样化的未来预测。
通过交叉注意力将异构输入——智能体、道路要素和时间状态——融合为统一表征。
通过基于不同未来目标或轨迹的条件输入，实现端到端的联合行为预测与规划学习。
应用可学习的位置编码，以建模时间步之间的动态变化。
在自动驾驶数据集上端到端训练模型，使用多任务损失联合优化行为预测与规划目标。
掩码策略使同一模型能够基于不同未来情景（如自动驾驶车辆轨迹或智能体目标）生成预测。

实验结果

研究问题

RQ1统一的深度学习模型能否在动态驾驶环境中有效同时完成行为预测与规划？
RQ2与独立预测相比，联合建模智能体交互在预测准确率与规划质量方面有何提升？
RQ3通过掩码策略，单一模型在多样化运动预测与规划任务中的泛化能力达到何种程度？
RQ4对智能体、道路要素和时间步的注意力机制是否增强了复杂驾驶场景的表征学习能力？
RQ5该模型能否基于不同的规划目标，生成多样且上下文相关的未来轨迹？

主要发现

Scene Transformer 在自动驾驶行为预测基准上实现了最先进性能。
通过捕捉智能体之间的未来交互可能性，模型在规划方面表现出更强的鲁棒性。
掩码策略使单一模型能够基于不同未来情景（如自动驾驶车辆轨迹或智能体目标）生成预测。
通过注意力机制联合建模智能体与环境，使未来运动预测更加连贯与真实。
统一架构减少了对独立预测与规划模型的需求，提升了效率与一致性。
该模型在多种任务（包括行为预测、轨迹预测与规划）中具有良好的泛化能力，且仅需极少的架构修改。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。