[論文レビュー] Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking
論文は、Siameseに類似した追跡に統合されたエンコーダとデコーダの分離ブランチを持つトランスフォーマーベースのフレームワークを導入し、時系列コンテキストを伝播させることでSiameseとDCF/DiMPパイプラインの両方を改善し、複数のベンチマークで最先端の結果を達成します。
In video object tracking, there exist rich temporal contexts among successive frames, which have been largely overlooked in existing trackers. In this work, we bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches and carefully design them within the Siamese-like tracking pipelines. The transformer encoder promotes the target templates via attention-based feature reinforcement, which benefits the high-quality tracking model generation. The transformer decoder propagates the tracking cues from previous templates to the current frame, which facilitates the object searching process. Our transformer-assisted tracking framework is neat and trained in an end-to-end manner. With the proposed transformer, a simple Siamese matching approach is able to outperform the current top-performing trackers. By combining our transformer with the recent discriminative tracking pipeline, our method sets several new state-of-the-art records on prevalent tracking benchmarks.
研究の動機と目的
- Identify and leverage temporal context across video frames to improve visual tracking robustness.
- Design a transformer that suits tracking by separating encoder and decoder branches within Siamese-like pipelines.
- Enable temporal feature reinforcement and cue propagation to handle occlusion, appearance changes, and distractors.
提案手法
- Separate encoder and decoder into two parallel branches within a Siamese-like tracking framework.
- Encoder: perform self-attention across multiple templates to produce encoded high-quality template features.
- Decoder: perform cross-attention between encoded templates and the current search patch to propagate temporal cues and masks.
- Mask transformation: propagate template masks to reinforce spatial attentions in the search patch.
- Feature transformation: propagate target representations from templates to the search patch with masking to focus on target regions.
- Train end-to-end with either a Siamese or DiMP-based tracking model; update template ensemble every 5 frames with a maximum of 20 templates.
実験結果
リサーチクエスチョン
- RQ1How can temporal context across video frames be effectively modeled to improve robust tracking?
- RQ2Can a transformer architecture be adapted to a Siamese-like tracking framework to reinforce template features and propagate temporal cues?
- RQ3What is the impact of encoder-only, decoder-only, and combined encoder-decoder configurations on tracking performance?
- RQ4How does the transformer-enhanced tracking perform when integrated with Siamese and DiMP/DCF pipelines across standard benchmarks?
主な発見
| バリエーション | Siamese (AO) | DiMP (AO) |
|---|---|---|
| ベースライン性能 | 62.0 | 66.7 |
| Only Encoder (w/o Any Decoder) | 63.8 | 67.3 |
| Encoder + Decoder (Only Feature Transf.) | 66.3 | 68.1 |
| Encoder + Decoder (Only Mask Transf.) | 67.1 | 67.8 |
| Encoder + Decoder (Feature & Mask Transf.) | 67.3 | 68.8 |
- Encoder-only configuration yields modest gains over baselines.
- Feature-based decoder transformation provides a notable boost for both Siamese and DiMP baselines.
- Mask-based decoder transformation also yields consistent improvements.
- Combining feature and mask transformations gives the largest gains, significantly reducing training losses and boosting AO on GOT-10k for both baselines.
- With the complete transformer, both TrSiam and TrDiMP achieve notable performance gains and narrow the gap between the two baselines.
- The Transformer-enhanced trackers achieve competitive or state-of-the-art results across TrackingNet, GOT-10k, LaSOT, VOT2018, NfS, UAV123, and OTB-2015 datasets.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。