QUICK REVIEW

[論文レビュー] 3D Object Tracking with Transformer

Yubo Cui, Zheng Fang|arXiv (Cornell University)|Oct 28, 2021

Video Surveillance and Tracking Methods参考文献 24被引用数 32

ひとこと要約

LTTR は LiDAR ベースの 3D オブジェクト追跡のための Transformer ベースの特徴融合フレームワークを導入し、KITTI での intra-/inter-region 関係とクロスブランチ情報交換をモデル化することにより、最先端の結果を達成します。

ABSTRACT

Feature fusion and similarity computation are two core problems in 3D object tracking, especially for object tracking using sparse and disordered point clouds. Feature fusion could make similarity computing more efficient by including target object information. However, most existing LiDAR-based approaches directly use the extracted point cloud feature to compute similarity while ignoring the attention changes of object regions during tracking. In this paper, we propose a feature fusion network based on transformer architecture. Benefiting from the self-attention mechanism, the transformer encoder captures the inter- and intra- relations among different regions of the point cloud. By using cross-attention, the transformer decoder fuses features and includes more target cues into the current point cloud feature to compute the region attentions, which makes the similarity computing more efficient. Based on this feature fusion network, we propose an end-to-end point cloud object tracking framework, a simple yet effective method for 3D object tracking using point clouds. Comprehensive experimental results on the KITTI dataset show that our method achieves new state-of-the-art performance. Code is available at: https://github.com/3bobo/lttr.

研究の動機と目的

3D追跡における疎で無秩序な LiDAR 点群に対して、特徴融合の改善を動機づける。
点群内での領域間および領域内の関係を捉えるために自己注意を活用する。
クロスアテンションを介してテンプレート特徴と検索特徴を融合し、ターゲット手掛かりを強化する。
単純で効率的な Transformer ベースの設計を採用したエンドツーエンドの追跡フレームワークを開発する。
KITTI における最先端性能を示し、設計選択を検証するアブレーションを提供する。

提案手法

点群を重なりのない局所領域に分割し、局所領域内および局所領域間の関係を捉えるために Transformer エンコーダを適用する。
クロスアテンションを介して領域レベルの融合を行うため、テンプレート特徴を検索特徴へ伝播させるために Transformer デコーダを用いる。
領域アテンション重みを計算し、回帰ヘッドを可能にするガイド付き結合プロセスを通じて密な特徴を回復する。
3D ボックス定位のためにヒートマップ、オフセット、z-position、姿勢を予測するセンター基盤の回帰ヘッドを採用する。
回帰ターゲットのための L1 ロスとヒートマップ focal loss を組み合わせた損失でエンドツーエンドで訓練する。

実験結果

リサーチクエスチョン

RQ1Transformer ベースの特徴融合は、LiDAR ベースの 3D 追跡における領域アテンションと類似度計算を改善できるか。
RQ2テンプレート特徴と検索特徴間のクロスアテンションは、疎な点群における追跡精度と頑健性を向上させるか。
RQ3エンドツーエンドの LTTR フレームワークは、複数のオブジェクトカテゴリで KITTI において最先端の性能を達成できるか。

主な発見

LTTR は KITTI で最先端の結果を達成し、特に Car カテゴリで Success が 65.0、Precision が 77.1 である。
エンコーダとデコーダの両方のコンポーネントが、非 Transformer ベースラインに対して顕著な性能向上に寄与する。
Transformer ヘッドを最大 8 まで増やすと性能が向上するが、ヘッドが多すぎると姿勢推定の精度が低下することがある。
このフレームワークはリアルタイム性を維持し、特に小型物体（Pedestrian、Cyclist）で追跡が改善される。
アブレーションの結果、領域レベルの相互作用とクロスブランチ融合が優れた精度の鍵であることが示される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。