QUICK REVIEW

[論文レビュー] TrTr: Visual Tracking with Transformer

Moju Zhao, Kei Okada|arXiv (Cornell University)|May 9, 2021

Video Surveillance and Tracking Methods参考文献 54被引用数 73

ひとこと要約

TrTrは視覚追跡のためのTransformerエンコーダ-デコーダアーキテクチャを導入し、クロスコリレーションを自己-attentionとクロス-attentionに置換してグローバルな文脈依存を捉え、オンライン更新モジュールを追加して頑健性を高める。

ABSTRACT

Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and search images show the state-of-the-art tracking performance. However, general cross-correlation operation can only obtain relationship between local patches in two feature maps. In this paper, we propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture to gain global and rich contextual interdependencies. In this new architecture, features of the template image is processed by a self-attention module in the encoder part to learn strong context information, which is then sent to the decoder part to compute cross-attention with the search image features processed by another self-attention module. In addition, we design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. We extensively evaluate our tracker TrTr, on VOT2018, VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT benchmarks and our method performs favorably against state-of-the-art algorithms. Training code and pretrained models are available at https://github.com/tongtybj/TrTr.

研究の動機と目的

グローバルな文脈を捉えることで、局所的なクロスコリレーションを超えた追跡の頑健性と精度の向上を動機づける。
追跡のためのターゲット分類とバウンディングボックス回帰の両方を実行するTransformerベースのアーキテクチャを提案する。
追跡中の外観変化に適応するオンライン更新モジュールを組み込む。
主要なベンチマークで評価し、競争力のある性能とリアルタイム速度を示す。

提案手法

自己注意を用いてテンプレート特徴を処理するTransformerエンコーダを使用する。
自己注意とテンプレート特徴へのクロス注意を用いて検索特徴を処理するTransformerデコーダを使用する。
従来のクロスコリレーションをマルチヘッド注意に置換し、グローバルな関係をモデル化する。
形状に依存しないアンカーベースのヘッドを分類と回帰に適用する。
追跡中の分類を適応させるオンライン更新ブランチを組み込む。
分類にフォーカルロス、回帰損失にはL1ベースを用い、大規模な動画データセットでエンドツーエンドに学習する。

実験結果

リサーチクエスチョン

RQ1Transformerベースのアテンション機構は、局所的なクロスコリレーションを超えてグローバルな文脈推論を可能にし、追跑の精度と頑健性を向上させるか？
RQ2形状に依存しないアンカーベースの回帰ヘッドは、外観変化やディストラクター下で位置推定を改善するか？
RQ3オンライン更新モジュールの追加が追跡性能と頑健性に与える影響は？
RQ4縮小されたTransformerの深さ（1エンコーダ + 1デコーダ）が追跡の性能と速度に与える影響は？
RQ5このアプローチは、ベンチマーク全体で最先端のSiameseベースの追跡器と競いながらリアルタイム追跡を達成できるか？

主な発見

データセット	TrTr-offline A	TrTr-offline R	TrTr-offline EAO	TrTr-online A	TrTr-online R	TrTr-online EAO
VOT2018	0.612	0.234	0.424	0.606	0.110	0.493
VOT2019	0.608	0.441	0.313	0.601	0.228	0.384
OTB-100	0.691 (offline)	-	-	0.715 (online)	-	-
UAV123	59.4	-	-	65.2	-	-
NfS	55.2	-	-	63.1	-	-
TrackingNet	69.3	-	-	71.0	-	-
LaSOT	46.3	-	-	55.1	-	-

TrTr-offlineはVOT2018/2019で高い精度と頑健性を達成し、精度の点でいくつかのSiameseベースの追跡器を上回る。
オンライン更新モジュール（TrTr-online）の追加は、オフラインのみと比較してVOTベンチマークのEAOを著しく改善する。
OTB-100では、TrTr-onlineが評価方法の中で報告された中で最高のAUCを達成する。
UAV123とNfSでは、TrTr-onlineは上位手法の中に位置し、いくつかのベースラインに対して顕著な向上を示す。
TrackingNetとLaSOTでは、TrTrは競争力のある性能を示すが、より大規模なデータセットでは改善の余地を示す。
モデルはリアルタイムで動作し、オフライン時は約50 FPS、オンライン更新を統合すると約35 FPS。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。