QUICK REVIEW

[論文レビュー] Latent Variable Sequential Set Transformers For Joint Multi-Agent Motion Prediction

Roger Girgis, Florian Golemo|arXiv (Cornell University)|Feb 19, 2021

Autonomous Vehicle Technology and Safety被引用数 38

ひとこと要約

AutoBots は潜在変数を用いた transformer ベースのエンコーダ/デコーダでセットの列をモデル化し、共同のマルチエージェント運動予測を行い、シーン整合性のある未来軌道を高速かつ多モーダルに生成する。nuScenesとArgoverseで強力な結果を達成しつつ、単一GPUトレーニングを可能にする。

ABSTRACT

Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-agent trajectories. We refer to these architectures as "AutoBots". The encoder is a stack of interleaved temporal and social multi-head self-attention (MHSA) modules which alternately perform equivariant processing across the temporal and social dimensions. The decoder employs learnable seed parameters in combination with temporal and social MHSA modules allowing it to perform inference over the entire future scene in a single forward pass efficiently. AutoBots can produce either the trajectory of one ego-agent or a distribution over the future trajectories for all agents in the scene. For the single-agent prediction case, our model achieves top results on the global nuScenes vehicle motion prediction leaderboard, and produces strong results on the Argoverse vehicle prediction challenge. In the multi-agent setting, we evaluate on the synthetic partition of TrajNet++ dataset to showcase the model's socially-consistent predictions. We also demonstrate our model on general sequences of sets and provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. A distinguishing feature of AutoBots is that all models are trainable on a single desktop GPU (1080 Ti) in under 48h.

研究の動機と目的

潜在変数を用いてセットの列としてマルチエージェント運動予測をモデル化し、マルチモダリティを捉える。
時系列および社会的注意機構を用いたエンコーダ-デコーダ Transformer アーキテクチャを開発する。
学習可能なシードパラメータによって複数の将来モードを単一パスでデコードできるようにする。
エージェントとセットに関する順列等価性を保証する。
nuScenes、Argoverse、TrajNet++、および Omniglot データセットで高い性能を示す。

提案手法

時系列および社会的マルチヘッド自己注意ブロックを交互に用いてエージェント集合の列をエンコードし、コンテキストテンソルを生成する。
モード固有の学習可能なシードパラメータ行列とエンコーダのコンテキストを条件にした繰り返し MABD/MAB 層を用いて、複数の将来モードを並列にデコードする。
CNN由来のベクトル M_i を環境コンテキストとして取り込み、エージェントと時間刻みにわたって複製する。
離散 Z と変分様の Q を活用して後方分布を近似する潜在変数目的関数で訓練し、多様な出力を促進するモードエントロピー正則化項を追加する。
各未来タイムステップごとに各エージェントの分布のパラメータとして出力を生成する（例: 二変量ガウス分布）。
順列等価性を実証し、推論速度を自己回帰ベースラインと比較する（モードごとに1回の正送信）。

実験結果

リサーチクエスチョン

RQ1潜在変数を持つ逐次セットトランスフォーマー（AutoBot）は、時系列および社会的相互作用を同時にモデリングして、一貫したマルチエージェントの将来を生成できるか。
RQ2学習可能なシードパラメータを用いた単一パスデコードは、自己回帰サンプリングなしで効率的にマルチモーダルな将来モードを捉えられるか。
RQ3AutoBot は現実世界の自動運転ベンチマーク（nuScenes、Argoverse）および合成マルチエージェントデータセット（TrajNet++）でどのように性能を発揮するか。
RQ4Omniglot の筆跡列のようなタスクでも、モデルは多様で現実的、かつシーン整合性のある軌跡を生成できるか。
RQ5自己回帰法やエージェントごとの生成法と比較して、AutoBot の計算効率はどの程度か。

主な発見

Metric	AutoBot-Ego (c=10)	AutoBot-Ego (ensemble)	AutoBot-Ego (test) Min ADE (5)	AutoBot-Ego (test) Min ADE (10)	Miss Rate Top-5 (2m)	Miss Rate Top-10 (2m)	Min FDE (1)	Off Road Rate
nuScenes - Min ADE (5)	1.43	1.37	-	-	0.66	0.45	8.66	0.03
nuScenes - Min ADE (10)	1.05	1.03	-	-	0.62	0.44	8.19	0.02

AutoBot-Ego は NuScenes で強力な結果を達成し、最良の Min ADE (10) と低い Off Road Rate を実現、他は競合レベル。
AutoBot-Ego モデルを3つアンサンブルすると NuScenes の性能がさらに向上。
Argoverse では AutoBot-Ego (valid) が Min ADE 0.73、Min FDE 1.10、Miss Rate 0.12 を達成; AutoBot-Ego (test) は 0.89 Min ADE (top-5) および 1.41 Min FDE (top-5) を達成。
TrajNet++ の合成データでは、エンコーダ/デコーダの社会的注意が衝突を減らし、シーンレベルの MinADE/MinFDE を改善。
Omniglot のタスクでは、AutoBot は LSTM ベースラインより一貫性のあるスタイライズされたストロークを生成し、不確定性下での現実的な補完を含む。
AutoBot-Ego は nuScenes で GTX 1080 Ti の単一GPUで約3時間で訓練でき、推論は引用比較のいくつかの自己回帰ベースラインより約2倍速い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。