QUICK REVIEW

[논문 리뷰] Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Hao Luo, Pichao Wang|arXiv (Cornell University)|2021. 11. 23.

Video Surveillance and Tracking Methods참고 문헌 43인용 수 40

한 줄 요약

본 논문은 사람 재식별(ReID)을 위한 트랜스포머 기반 자기지도(pre-training)를 연구하고, 조건부 사전 학습을 위한 Catastrophic Forgetting Score (CFS)와 도메인 차이를 연결하는 IBN 기반 컨볼루션 스템(ICS)을 도입하여 Market-1501과 MSMT17에서 최첨단 성능을 달성한다.

ABSTRACT

Transformer-based supervised pre-training achieves great performance in person re-identification (ReID). However, due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset (e.g. ImageNet-21K) to boost the performance because of the strong data fitting ability of the transformer. To address this challenge, this work targets to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure, respectively. We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks. To further reduce the domain gap and accelerate the pre-training, the Catastrophic Forgetting Score (CFS) is proposed to evaluate the gap between pre-training and fine-tuning data. Based on CFS, a subset is selected via sampling relevant data close to the down-stream ReID data and filtering irrelevant data from the pre-training dataset. For the model structure, a ReID-specific module named IBN-based convolution stem (ICS) is proposed to bridge the domain gap by learning more invariant features. Extensive experiments have been conducted to fine-tune the pre-training models under supervised learning, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings. We successfully downscale the LUPerson dataset to 50% with no performance degradation. Finally, we achieve state-of-the-art performance on Market-1501 and MSMT17. For example, our ViT-S/16 achieves 91.3%/89.9%/89.6% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Codes and models will be released to https://github.com/michuanhaohao/TransReID-SSL.

연구 동기 및 목표

사전 학습 도메인과 ReID 목표 도메인 간의 격차를 데이터 및 모델 구조의 차이를 다룸으로써 해소한다.
레이블이 없는 사람 이미지에서의 SSL 사전 학습이 ViT 기반 ReID에서 ImageNet 감독 학습보다 우수할 수 있음을 입증한다.
성능을 유지하거나 개선하면서 사전 학습 데이터를 축소하는 데이터 효율적 조건부 사전 학습 방법(CFS)을 제안한다.
ViT 기반 ReID 모델의 불변성 및 안정성을 향상시키기 위해 IBN 기반 컨볼루션 스템(ICS)을 개발한다.
감독학습, 비지도 도메인 적응(UDA), 비지도 학습(USL) 설정에서 평가하고 최첨단 방법과 비교한다.

제안 방법

SSL 방법들(MoCoV2, MoCoV3, MoBY, DINO)을 ViT와 함께 LUPerson vs. ImageNet-pretrained 기준으로 경험적 연구로 비교한다.
트랜스포머 기반 ReID 사전 학습에 대해 DINO를 선호하는 SSL 방법으로 채택한다.
Catastrophic Forgetting Score(CFS)를 사용하여 사전 학습 데이터와 파인튜닝 데이터 간의 도메인 차이를 측정하고, LUPerson에서 조건부 데이터 필터링을 통해 더 작고 관련성이 높은 사전 학습 부분집합을 만든다.
ViT 최적화 안정성과 외관 불변 특성 학습을 개선하기 위해 ICS(IBN 기반 컨볼루션 스템)를 제안한다.
Market-1501과 MSMT17에서 감독/USL/UDA의 세 가지 파인튜닝 설정을 평가하고 ImageNet-사전학습 기준과 비교한다.

실험 결과

연구 질문

RQ1레이트 지도 없는 사람 이미지(LUPerson)에서의 SSL 사전 학습이 ViT 기반 ReID에서 ImageNet 감독 사전 학습보다 우수한가?
RQ2데이터 주도적 조건부 사전 학습 전략(CFS)이 사전 학습 데이터 크기와 시간을 줄이면서 다운스트림 성능을 해치지 않는가?
RQ3ReID 전용 컨볼루션 스템(ICS)이 ViT 성능과 안정성을 향상시키는가?
RQ4ViT 백본을 사용할 때 감독, USL 및 UDA 설정에서 SSL 사전 학습의 이득은 어느 정도인가?
RQ5제안된 방법이 Market-1501과 MSMT17에서 감독, USL/UDA ReID 방법들과 어떻게 비교되는가?

주요 결과

DINO 기반 SSL 사전 학습이 LUPerson에서 ViT-S/16으로 강한 ReID 성능을 보이며 종종 ImageNet-사전학습 기준을 능가한다.
Catastrophic Forgetting Score(CFS)와 사전 학습 데이터 필터링을 활용한 CondP 학습은 사전 학습 데이터를 50%(또는 30-60%)로 줄여도 다운스트림 성능이 같거나 향상되며 약 30%의 사전 학습 시간을 절감한다.
ICS(IBN 기반 컨볼루션 스템)는 감독, USL, UDA 설정 전반에서 ViT 기반 ReID 성능을 일관되게 향상시키며, 조건부 사전 학습에서도 그 이점이 지속된다.
평가 전반에 걸쳐 LUPerson에서의 자기지도 사전 학습은 일반적으로 트랜스포머 기반 ReID에서 ImageNet 감독보다 우수하며, USL 및 UDA 설정에서 MSMT17에서 특히 큰 이점을 보인다.
제안된 방법은 감독, UDA, USL ReID 시나리오에서 Market-1501과 MSMT17에서 최첨단 결과를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.