QUICK REVIEW

[論文レビュー] Scene Transformer: A unified multi-task model for behavior prediction and planning

Jiquan Ngiam, Benjamin Caine|arXiv (Cornell University)|Jun 15, 2021

Autonomous Vehicle Technology and Safety被引用数 48

ひとこと要約

本稿では、Transformerアーキテクチャ内でのマスキング戦略を活用して、エージェントの行動を同時に予測し、計画を可能にする統合的マルチタスクモデル「Scene Transformer」を提案する。エージェント、レーン要素、時間ステップの間で注目を向けることで、動的かつ包括的な相互作用をモデル化し、行動予測ベンチマークで最先端の性能を達成した。これは、1つのモデルが多様な運動予測および計画タスクに効果的に対応できることを示している。

ABSTRACT

Predicting the future motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence each other. Most prior work has focused on first predicting independent futures for each agent based on all past motion, and then planning against these independent predictions. However, planning against fixed predictions can suffer from the inability to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly in real-world driving environments in a unified manner. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture fuses heterogeneous world state in a unified Transformer architecture by employing attention across road elements, agent interactions and time steps. We evaluate our approach on autonomous driving datasets for behavior prediction, and achieve state-of-the-art performance. Our work demonstrates that formulating the problem of behavior prediction in a unified architecture with a masking strategy may allow us to have a single model that can perform multiple motion prediction and planning related tasks effectively.

研究の動機と目的

マルチエージェント自律走行環境における、独立した未来予測の限界を解消すること。
計画のロバスト性を向上させるために、エージェント行動と相互作用を統合的にモデル化すること。
1つの柔軟なアーキテクチャで、複数の運動予測および計画タスクを統合すること。
エージェント、レーン要素、時間ステップの間での注目メカニズムを活用して、包括的なシーン理解を実現すること。
マスキング戦略を用いることで、1つのモデルが多様な予測および計画クエリを効果的に処理できることを示すこと。

提案手法

マスク付き自己注意メカニズムを採用し、マスクをクエリとして用いて多様な未来予測を生成する。
エージェント、レーン要素、時間的状態といった異種の入力を、クロス注意を介して統合的表現に統合する。
さまざまな未来の目標や軌道に条件づけて、行動予測と計画の両方をエンドツーエンドで学習可能にする。
時間ステップにわたる時間的ダイナミクスをモデル化するため、学習可能な位置エンコーディングを適用する。
行動予測と計画の目的を兼ね備えたマルチタスク損失を用いて、自律走行データセット上でエンドツーエンドで訓練する。
マスキング戦略により、同じモデルがAV軌道やエージェントの目的など、異なる未来シナリオに条件づけた予測を生成可能になる。

実験結果

リサーチクエスチョン

RQ1統合的ディープラーニングモデルは、動的走行環境において、行動予測と計画の両方を効果的に実行できるか？
RQ2エージェント相互作用を統合的にモデル化することで、独立した予測と比較して予測精度と計画品質がどの程度向上するか？
RQ3マスキング戦略を用いることで、1つのモデルが多様な運動予測および計画タスクにどの程度一般化できるか？
RQ4エージェント、レーン要素、時間ステップの間での注目が、複雑な走行シーンにおける表現学習をどのように向上させるか？
RQ5異なる計画目的に条件づけて、多様かつ文脈的に適切な未来の軌道をモデルが生成できるか？

主な発見

Scene Transformerは、自律走行の行動予測ベンチマークで最先端の性能を達成した。
エージェント間の将来の相互作用可能性を捉えることで、計画のロバスト性が向上した。
マスキング戦略により、同じモデルがAV軌道やエージェントの目的など、さまざまな未来シナリオに条件づけた予測を生成可能になった。
注目メカニズムによるエージェントと環境の統合的モデリングは、より一貫性があり現実的な未来の運動予測をもたらした。
統合アーキテクチャにより、予測と計画のための別個モデルの必要性が軽減され、効率性と一貫性が向上した。
モデルは、最小限のアーキテクチャ変更で、行動予測、軌道予測、計画など複数のタスクに一般化できた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。