QUICK REVIEW

[论文解读] Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Hao Luo, Pichao Wang|arXiv (Cornell University)|Nov 23, 2021

Video Surveillance and Tracking Methods参考文献 43被引用 40

一句话总结

论文研究基于 Transformer 的自监督预训练用于 person ReID，提出用于条件预训练的 Catastrophic Forgetting Score (CFS) 和用于桥接领域差距的 IBN-based convolution stem (ICS)，在 Market-1501 和 MSMT17 上达到 state-of-the-art 结果。

ABSTRACT

Transformer-based supervised pre-training achieves great performance in person re-identification (ReID). However, due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset (e.g. ImageNet-21K) to boost the performance because of the strong data fitting ability of the transformer. To address this challenge, this work targets to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure, respectively. We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks. To further reduce the domain gap and accelerate the pre-training, the Catastrophic Forgetting Score (CFS) is proposed to evaluate the gap between pre-training and fine-tuning data. Based on CFS, a subset is selected via sampling relevant data close to the down-stream ReID data and filtering irrelevant data from the pre-training dataset. For the model structure, a ReID-specific module named IBN-based convolution stem (ICS) is proposed to bridge the domain gap by learning more invariant features. Extensive experiments have been conducted to fine-tune the pre-training models under supervised learning, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings. We successfully downscale the LUPerson dataset to 50% with no performance degradation. Finally, we achieve state-of-the-art performance on Market-1501 and MSMT17. For example, our ViT-S/16 achieves 91.3%/89.9%/89.6% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Codes and models will be released to https://github.com/michuanhaohao/TransReID-SSL.

研究动机与目标

Bridge the gap between pre-training and ReID target domains by addressing data and model structure differences.
Demonstrate that SSL pre-training on unlabelled person images can outperform ImageNet supervision for ViT-based ReID.
Propose a data-efficient conditional pre-training method (CFS) to downscale pre-training data while maintaining or improving performance.
Develop an IBN-based convolution stem (ICS) to improve invariance and stability of ViT-based ReID models.
Evaluate under supervised, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings and compare to state-of-the-art.

提出的方法

Empirical study comparing SSL methods (MoCoV2, MoCoV3, MoBY, DINO) with ViT on LUPerson versus ImageNet-pretrained baselines.
Adopt DINO as the preferred SSL method for transformer-based ReID pre-training.
Introduce Catastrophic Forgetting Score (CFS) to measure domain gap between pre-training and fine-tuning data and perform conditional data filtering from LUPerson to create a smaller, more relevant pre-training subset.
Propose IBN-based convolution stem (ICS) to improve ViT optimization stability and learning of appearance-invariant features.
Evaluate three fine-tuning settings (Supervised, USL, UDA) on Market-1501 and MSMT17 and compare with ImageNet-pretrained baselines.

实验结果

研究问题

RQ1Does SSL pre-training on unlabelled person images (LUPerson) outperform ImageNet supervised pre-training for ViT-based ReID?
RQ2Can a data-driven, conditional pre-training strategy (CFS) reduce pre-training data size and time without harming downstream performance?
RQ3Does an ReID-specific Convolution Stem (ICS) improve ViT performance and stability in ReID tasks?
RQ4What are the gains of SSL pre-training for supervised, USL, and UDA ReID settings when using ViT backbones?
RQ5How do the proposed methods compare to state-of-the-art supervised and USL/UDA ReID methods on Market-1501 and MSMT17?

主要发现

DINO-based SSL pre-training on LUPerson with ViT-S/16 yields strong ReID performance, often surpassing ImageNet-pre-trained baselines.
CondP training with Catastrophic Forgetting Score (CFS) and filtering of pre-training data can reduce pre-training data to 50% (and even 30-60%) with equal or improved downstream performance, saving about 30% pre-training time.
ICS (IBN-based convolution stem) consistently improves ViT-based ReID performance across supervised, USL, and UDA settings, and its benefits persist even in conditional pre-training.
Across evaluations, self-supervised pre-training on LUPerson generally outperforms ImageNet supervision for transformer-based ReID, with notable gains in MSMT17 under USL and UDA settings.
The proposed approach achieves state-of-the-art results on Market-1501 and MSMT17 under supervised, UDA, and USL ReID scenarios.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。