QUICK REVIEW

[论文解读] TrTr: Visual Tracking with Transformer

Moju Zhao, Kei Okada|arXiv (Cornell University)|May 9, 2021

Video Surveillance and Tracking Methods参考文献 54被引用 73

一句话总结

TrTr 引入了用于视觉跟踪的 Transformer 编码器-解码器架构，用自注意力和跨注意力替代互相关系以捕获全局上下文依赖，并添加一个在线更新模块以提高鲁棒性。

ABSTRACT

Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and search images show the state-of-the-art tracking performance. However, general cross-correlation operation can only obtain relationship between local patches in two feature maps. In this paper, we propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture to gain global and rich contextual interdependencies. In this new architecture, features of the template image is processed by a self-attention module in the encoder part to learn strong context information, which is then sent to the decoder part to compute cross-attention with the search image features processed by another self-attention module. In addition, we design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. We extensively evaluate our tracker TrTr, on VOT2018, VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT benchmarks and our method performs favorably against state-of-the-art algorithms. Training code and pretrained models are available at https://github.com/tongtybj/TrTr.

研究动机与目标

通过捕获全局上下文，推动在超越局部互相关的情况下提升跟踪鲁棒性与准确性。
提出一种基于 Transformer 的架构，用于在跟踪中同时实现目标分类和边界框回归。
引入在线更新模块，以适应跟踪过程中的外观变化。
在主要基准测试上进行评估，以展示有竞争力的性能和实时速度。

提出的方法

使用 Transformer 编码器对模板特征进行自注意力处理。
使用 Transformer 解码器对搜索特征进行自注意力处理，并对模板特征进行跨注意力。
用多头注意力替代传统的互相关，以建模全局关系。
为分类和回归应用形状无关的锚点基头。
加入在线更新分支以在跟踪过程中自适应分类。
在大型视频数据集上端到端训练，分类使用 focal 损失，回归使用基于 L1 的损失。

实验结果

研究问题

RQ1基于 Transformer 的注意力机制是否能够实现全局上下文推理，从而在相较于局部互相关提升跟踪的准确性和鲁棒性？
RQ2形状无关的锚点回归头是否在外观变化和干扰对象下提升定位效果？
RQ3添加在线更新模块对跟踪性能和鲁棒性的影响是什么？
RQ4减少的 Transformer 深度（1 个编码器 + 1 个解码器）对跟踪的性能和速度有何影响？
RQ5这种方法是否能够在各基准上实现实时跟踪，并与最先进的基于 Siamese 的跟踪器竞争？

主要发现

数据集	TrTr-offline A	TrTr-offline R	TrTr-offline EAO	TrTr-online A	TrTr-online R	TrTr-online EAO
VOT2018	0.612	0.234	0.424	0.606	0.110	0.493
VOT2019	0.608	0.441	0.313	0.601	0.228	0.384
OTB-100	0.691 (offline)	-	-	0.715 (online)	-	-
UAV123	59.4	-	-	65.2	-	-
NfS	55.2	-	-	63.1	-	-
TrackingNet	69.3	-	-	71.0	-	-
LaSOT	46.3	-	-	55.1	-	-

TrTr-offline 在 VOT2018/2019 上实现较高的准确性和鲁棒性，在准确性方面优于若干基于 Siamese 的跟踪器。
相比于离线 alone，加入在线更新模块（TrTr-online）显著提升了 VOT 基准上的 EAO。
在 OTB-100 上，TrTr-online 实现了评估方法中报告的最高 AUC。
在 UAV123 和 NfS 上，TrTr-online 位居前列，与若干基线相比取得显著提升。
在 TrackingNet 与 LaSOT 上，TrTr 展现出竞争力的性能，但在更大规模数据集上存在需改进之处。
模型实现实时运行，离线约 50 FPS，集成在线更新后约 35 FPS。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。