QUICK REVIEW

[論文レビュー] Scene Transformer: A unified architecture for predicting multiple agent trajectories

Jiquan Ngiam, Benjamin Caine|arXiv (Cornell University)|Jun 15, 2021

Autonomous Vehicle Technology and Safety被引用数 53

ひとこと要約

Scene Transformerは、シーン中心のマスクドシーケンスモデリング手法と軸分解型アテンションを用いて、周辺と結合のマルチエージェント軌道予測を統合し、目標や他のエージェントへの条件付けを可能にします。

ABSTRACT

Predicting the motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g. vehicles and pedestrians) and their associated behaviors may be diverse and influence one another. Most prior work have focused on predicting independent futures for each agent based on all past motion, and planning against these independent predictions. However, planning against independent predictions can make it challenging to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly, producing consistent futures that account for interactions between agents. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture employs attention to combine features across road elements, agent interactions, and time steps. We evaluate our approach on autonomous driving datasets for both marginal and joint motion prediction, and achieve state of the art performance across two popular datasets. Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy, we show that our model can unify a variety of motion prediction tasks from joint motion predictions to conditioned prediction.

研究の動機と目的

Motivate unified motion prediction and planning by modeling interactions between all agents jointly rather than independently.
Develop a scene-centric, permutation-equivariant transformer architecture that scales to dense scenes with many agents.
Introduce a masking-based sequence modeling formulation that enables conditioning on AV goals or full futures at inference.
Demonstrate state-of-the-art performance on marginal and joint prediction benchmarks across Argoverse and Waymo Open Motion Dataset.

提案手法

Represent all agents and road graph elements in a scene-centric tensor [A, T, D].
Use axis-factored self-attention (across time and across agents) with alternating layers to efficiently capture temporal and inter-agent interactions.
Apply cross-attention to incorporate road graph information via shared road embeddings.
Train with a masked sequence modeling objective inspired by BERT to support multiple tasks (motion prediction, conditional motion prediction, goal-conditioned prediction).
Decode multiple futures per scene; predict per-agent trajectories with associated uncertainties and headings.
Compute a scene-level or per-agent loss depending on marginal vs. joint prediction tasks, enabling a single model to switch between tasks.

実験結果

リサーチクエスチョン

RQ1Can a single, scene-centric, transformer-based model produce both marginal and joint multi-agent predictions with consistent futures?
RQ2Does axis-factored attention improve efficiency and performance compared to full joint attention in multi-agent motion modeling?
RQ3Can masked sequence modeling enable conditioning on AV goals or full futures without task-specific architectures?
RQ4Does joint training with a masking strategy yield better joint prediction metrics than marginal training?
RQ5How does the model perform on standard benchmarks (Argoverse, Waymo Open Motion) for both marginal and joint predictions?”

主な発見

Achieves state-of-the-art results on marginal motion prediction benchmarks on Argoverse and Waymo Open Motion Dataset.
Outperforms baselines on joint (interactive) prediction tasks in Waymo Open Motion Dataset when trained with a joint loss.
Factorized time- and agent-axis attention provides computational efficiency and improves accuracy over non-factorized attention.
Masked sequence modeling enables flexible conditioning (CMP, GCP) and supports multi-task training without sacrificing standard MP performance.
Demonstrates that a single model can perform motion prediction, conditional motion prediction, and goal-conditioned prediction with minimal performance degradation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。