QUICK REVIEW

[논문 리뷰] Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Guanshuo Wang, Fufu Yu|arXiv (Cornell University)|2023. 03. 08.

Video Surveillance and Tracking Methods인용 수 13

한 줄 요약

논문은 TP-TPS를 제안하는데, 이는 이중 사전학습 인코더를 완전 활용하고 MIDC와 DAP를 도입하여 robust한 교차모달 정렬과 미세한 표현을 위한 텍스트 잠재력을 활용하는 VLP 기반 텍스트 기반 인물 검색 프레임워크이다.

ABSTRACT

Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.

연구 동기 및 목표

시각과 언어의 사전학습을 텍스트 기반 인물 검색(TPS)에 활용하는 것을 동기화한다.
미세 튜닝을 최소화한 이중 사전학습 인코더를 사용하는 기본 VLP-TPS 모델을 개발한다.
텍스트의 무결성과 교차모달 정렬를 보장하기 위해 MIDC를 도입한다.
미세한 속성 세부 정보를 안내하는 텍스트 유래 힌트를 통해 시각 표현을 지도하는 속성 프롬프트(DAP)를 도입한다.
벤치마크 전반에서 최첨단 TPS 결과를 보여주고 구성요소의 기여를 분석한다.

제안 방법

CLIP 기반의 시각 및 텍스트 백본을 갖는 이중 사전학습 인코더를 사용하는 기본 TPS.
두 모달리티에서 미세한 패치 및 단어 수준의 특징을 얻기 위한 토큰 풀링.
MIDC(다중 무결성 설명 제약)으로 불완전한 속성 서술을 생성하고 교차모달 및 무결성 손실을 강제.
DAP(다이나믹 속성 프롬프트)로 속성 기반 프롬프트를 생성하여 텍스트 유래 힌트를 통해 시각 특징을 안내하고, 프롬프트를 시각 인코더 감독에 사용.
결합 목표 함수: L = L_cls + L_align + lambda0*L_int + lambda1*L_pmt, 추가적인 프롬프트 측 및 무결성 제약 포함.

Figure 1 : A typical Text-based Person Search model initialized with Vision-Language Pre-training models. The pre-trained vision encoders are used to initialize the TPS image representation, but textual encoders are altered by external language models such as LSTM or BERT, which is a asymmetry setti

실험 결과

연구 질문

RQ1VLP-TPS에서 텍스트 인코더를 어떻게 전면적으로 활용하여 TPS의 교차모달 정렬을 개선할 수 있는가?
RQ2MIDC와 DAP가 텍스트 무결성 및 속성 프롬프트를 활용하여 기본선 대비 측정 가능한 이점을 제공하는가?
RQ3CLIP 기반 텍스트 인코더를 통합하는 것이 벤치마크 전반의 TPS 성능에 어떤 영향을 미치는가?
RQ4MIDC와 DAP가 상호 작용하여 미세한 속성 표현에 대한 개선에 어떻게 기여하는가?

주요 결과

랭크1	랭크5	랭크10	mAP
TP-TPS (CUHK-PEDES)	70.16	86.10	90.98	66.32
VLP-TPS (CUHK-PEDES)	65.38	82.74	88.98	62.47
TP-TPS (ICFG-PEDES)	60.64	75.97	81.76	42.78
VLP-TPS (ICFG-PEDES)	56.79	72.70	78.98	40.59
TP-TPS (RSTPReid)	50.65	72.45	81.20	43.11
VLP-TPS (RSTPReid)	45.55	68.85	77.60	40.99

TP-TPS는 CUHK-PEDES에서 Rank-1 70.16% 및 mAP 66.32%의 최첨단 결과를 달성한다.
ICFG-PEDES에서 TP-TPS는 Rank-1 60.64% 및 mAP 42.78%를 달성한다.
RSTPReid에서 TP-TPS는 Rank-1 50.65% 및 mAP 43.11%에 도달한다.
CLIP-TE를 사용하면 기본 성능이 향상되고, MIDC와 DAP를 추가하면 기본 VLP-TPS를 넘어서는 추가 이점이 있다.
MIDC는 부분 묘사를 통한 텍스트 무결성을 강제함으로써 일관된 개선을 제공하는 반면, DAP는 시각 표현을 강화하는 미세한 속성 지침을 제공한다.

Figure 2 : Overview pipeline of the proposed TP-TPS. A simple baseline framework is developed based on CLIP pre-trained model. Visual and textual token pooling operations are employed to represent token-level fine-grained features for both modalities. We further introduce the Multi-Integrity Descrip

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.