[论文解读] Spatiotemporal Transformer for Video-based Person Re-identification
这篇论文提出了一种用于基于视频的人员再识别的时空 Transformer(STT),通过受限注意力和全局分支来缓解过拟合,并利用合成数据预训练,在 MARS、DukeMTMC-VideoReID 和 LS-VID 上达到最先进的结果。
Recently, the Transformer module has been transplanted from natural language processing to computer vision. This paper applies the Transformer to video-based person re-identification, where the key issue is to extract the discriminative information from a tracklet. We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting, arguably due to a large number of attention parameters and insufficient training data. To solve this problem, we propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains with the perception-constrained Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks, MARS, DukeMTMC-VideoReID, and LS-VID, especially when the training and testing data are from different domains. More importantly, our research sheds light on the application of the Transformer on highly-structured visual data.
研究动机与目标
- Motivate the use of Transformer architectures for video-based person re-identification (ReID).
- Mitigate over-fitting in Transformers given limited video ReID data via constraints and global attention.
- Propose a synthesized-data pre-training pipeline to improve initialization and generalization.
- Show empirical gains on standard video-based ReID benchmarks and analyze attention behavior.
提出的方法
- Propose a two-stage Spatiotemporal Transformer (STT) with a Spatial Transformer (ST) operating on image patches and a Temporal Transformer (TT) aggregating frame-wise tokens into a tracklet representation.
- Apply constrained attention learning: spatial constraints with part-based and full-image cross-entropy losses to prevent over-focus on limited regions; temporal constraints combining frame-level triplet supervision and a temporal attention entropy loss.
- Introduce a Global Transformer (GT) branch that processes all frame patches within a tracklet to model cross-frame patch relationships.
- Use synthesized video data (UnrealPerson) for pre-training to alleviate data scarcity and improve initialization before fine-tuning on real datasets.
- Train with a CNN backbone (ResNet-50 first three blocks), patchify feature maps into tokens, and use an additional spatial token and a temporal token to fuse information across space and time.]
- research_questions: ["Can Transformer-based architectures be effectively applied to video-based person ReID tasks?","How can attention mechanisms be constrained to prevent over-fitting on limited video ReID data?","Does a global attention branch complement STT by linking patches across frames?","Does synthetic video pre-training improve generalization and performance on real ReID benchmarks?"]
- key_findings:[
实验结果
研究问题
- RQ1Can Transformer-based architectures be effectively applied to video-based person ReID tasks?
- RQ2How can attention mechanisms be constrained to prevent over-fitting on limited video ReID data?
- RQ3Does a global attention branch complement STT by linking patches across frames?
- RQ4Does synthetic video pre-training improve generalization and performance on real ReID benchmarks?
主要发现
- The proposed constrained STT with a Global Transformer significantly outperforms CNN baselines and vanilla Transformers on MARS, Duke, and LS-VID, especially under cross-domain evaluation.
- Constrained spatial attention reduces over-fitting and improves cross-domain transfer (e.g., Duke performance improves from 50.6% to 60.5% rank-1 when trained on MARS).
- Global attention learning provides notable gains by enabling cross-frame patch relationships (e.g., Duke: rank-1 improves by ~3.9% with GT).
- Synthesized video pre-training yields substantial improvements in direct transfer across all three datasets, indicating better initialization and convergence.
- Across ablations, the strongest configuration combines Spatial+Temporal constraints, Global Attention, and Synthesized Pre-training, achieving the best reported results.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。