QUICK REVIEW

[论文解读] Spatiotemporal Transformer for Video-based Person Re-identification

Tianyu Zhang, Longhui Wei|arXiv (Cornell University)|Mar 30, 2021

Video Surveillance and Tracking Methods参考文献 42被引用 30

一句话总结

这篇论文提出了一种用于基于视频的人员再识别的时空 Transformer（STT），通过受限注意力和全局分支来缓解过拟合，并利用合成数据预训练，在 MARS、DukeMTMC-VideoReID 和 LS-VID 上达到最先进的结果。

ABSTRACT

Recently, the Transformer module has been transplanted from natural language processing to computer vision. This paper applies the Transformer to video-based person re-identification, where the key issue is to extract the discriminative information from a tracklet. We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting, arguably due to a large number of attention parameters and insufficient training data. To solve this problem, we propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains with the perception-constrained Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks, MARS, DukeMTMC-VideoReID, and LS-VID, especially when the training and testing data are from different domains. More importantly, our research sheds light on the application of the Transformer on highly-structured visual data.

研究动机与目标

Motivate the use of Transformer architectures for video-based person re-identification (ReID).
Mitigate over-fitting in Transformers given limited video ReID data via constraints and global attention.
Propose a synthesized-data pre-training pipeline to improve initialization and generalization.
Show empirical gains on standard video-based ReID benchmarks and analyze attention behavior.

提出的方法

Propose a two-stage Spatiotemporal Transformer (STT) with a Spatial Transformer (ST) operating on image patches and a Temporal Transformer (TT) aggregating frame-wise tokens into a tracklet representation.
Apply constrained attention learning: spatial constraints with part-based and full-image cross-entropy losses to prevent over-focus on limited regions; temporal constraints combining frame-level triplet supervision and a temporal attention entropy loss.
Introduce a Global Transformer (GT) branch that processes all frame patches within a tracklet to model cross-frame patch relationships.
Use synthesized video data (UnrealPerson) for pre-training to alleviate data scarcity and improve initialization before fine-tuning on real datasets.
Train with a CNN backbone (ResNet-50 first three blocks), patchify feature maps into tokens, and use an additional spatial token and a temporal token to fuse information across space and time.]
research_questions: ["Can Transformer-based architectures be effectively applied to video-based person ReID tasks?","How can attention mechanisms be constrained to prevent over-fitting on limited video ReID data?","Does a global attention branch complement STT by linking patches across frames?","Does synthetic video pre-training improve generalization and performance on real ReID benchmarks?"]
key_findings:[

实验结果

研究问题

RQ1Can Transformer-based architectures be effectively applied to video-based person ReID tasks?
RQ2How can attention mechanisms be constrained to prevent over-fitting on limited video ReID data?
RQ3Does a global attention branch complement STT by linking patches across frames?
RQ4Does synthetic video pre-training improve generalization and performance on real ReID benchmarks?

主要发现

The proposed constrained STT with a Global Transformer significantly outperforms CNN baselines and vanilla Transformers on MARS, Duke, and LS-VID, especially under cross-domain evaluation.
Constrained spatial attention reduces over-fitting and improves cross-domain transfer (e.g., Duke performance improves from 50.6% to 60.5% rank-1 when trained on MARS).
Global attention learning provides notable gains by enabling cross-frame patch relationships (e.g., Duke: rank-1 improves by ~3.9% with GT).
Synthesized video pre-training yields substantial improvements in direct transfer across all three datasets, indicating better initialization and convergence.
Across ablations, the strongest configuration combines Spatial+Temporal constraints, Global Attention, and Synthesized Pre-training, achieving the best reported results.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。