QUICK REVIEW

[논문 리뷰] Unified Multi-Dataset Training for TBPS

Nilanjana Chatterjee, Sidharatha Garg|arXiv (Cornell University)|2026. 01. 21.

Video Surveillance and Tracking Methods인용 수 0

한 줄 요약

Scale-TBPS는 다수의 TBPS 데이터셋에 걸쳐 노이즈 인식 데이터 큐레이션을 사용하고 확장 가능한 판별식 신원 학습 목표를 갖춘 단일 통합 텍스트 기반 인물 검색 모델을 학습시켜 데이터셋별 및 순진한 공동 학습 접근법보다 우수한 성능을 보인다.

ABSTRACT

Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.

연구 동기 및 목표

여러 분포를 다루는 단일 통합 모델로 데이터셋 중심 TBPS를 벗어나려는 동기 부여.
TBPS 데이터셋을 병합할 때 크로스-데이터셋 노이즈와 분포 변화를 완화한다.
정체성 수가 증가해도 식별력이 유지되도록 확장 가능한 신원 학습을 개발한다.
단일화된 학습이 독립적으로 훈련된 데이터셋별 모델보다 성능이 우수함을 보여준다.

제안 방법

노이즈 인식 통합 데이터 큐레이션 (NDC)은 사전 학습된 TBPS 모델의 앙상블을 사용하여 신뢰할 수 없는 텍스트–이미지 쌍을 임계값 없이 필터링한다.
Discriminative Identity Learning (DIL)은 이미지와 텍스트 모달리티 모두에 각도 여백을 강제하는 다중 모달 각도 기반 신원 손실을 도입한다.
공유 다중 모달 분류기 가중치 벡터 w가 모든 신원에 대해 각도 여백 기반 로짓을 계산하는 데 사용된다.
학습은 다중 모달 각도 신원 손실과 랭킹 손실을 결합하여 크로스-모달 정렬 및 식별을 최적화한다.
이 접근법은 CLIP 기반 인코더를 바탕으로 확장 가능하고 각도 여백 목표를 포함하도록 확장한다.

Figure 1: Illustration of Scale-TBPS. (a) illustrates the conventional dataset-centric training paradigm, where separate models are independently trained for different distributions, resulting in isolated models. (b) depicts naive joint training, where a single model is trained on merged datasets; h

실험 결과

연구 질문

RQ1다양한 분포를 가진 여러 TBPS 데이터셋에서 단일 모델을 효과적으로 학습시킬 수 있는가?
RQ2대규모 TBPS에서 유용한 데이터를 버리지 않으면서 노이즈가 있는 크로스-데이터셋 텍스트–이미지 쌍을 어떻게 큐레이션할 수 있는가?
RQ3크로스-데이터셋에 걸친 많은 수의 신원에 대해 판별적이고 각도 여백 기반의 신원 학습 목표가 확장되는가?
RQ4일 unified TBPS 모델에서 테스트 타임 유사도 정규화가 검색 성능에 미치는 영향은 무엇인가?

주요 결과

NDC 및 DIL을 갖춘 단일 Scale-TBPS 모델이 여러 TBPS 벤치마크에서 데이터셋별 및 순진한 공동 학습 방법과 같거나 우수한 성능을 보인다.
Scale-TBPS는 여러 CLIP 기반 및 비-CLIP 베이스라인에 비해 우수한 mean average precision (mAP) 및 순위 지표를 달성한다.
테스트 타임 유사도 정규화 (NNN)는 특히 특정 데이터셋에서 검색 성능에 주목할 만한 이점을 제공한다.
NDC 모듈은 한 번의 전처리 단계에서 노이즈 쌍을 효과적으로 필터링하여 다수 TBPS 데이터셋의 확장 가능한 병합을 가능하게 한다.
DIL 시각화는 순진한 공동 학습과 비교하여 클래스 내부 군집이 더 촘촘하고 클래스 간 구분이 더 명확함을 보여준다.

Figure 2: Overview of the proposed Scale-TBPS. (a) Noise-Aware Data Curation (NDC): Text–image pairs from the joint dataset ( $\mathcal{D}$ ) are encoded using a set of pretrained and frozen models $\Phi$ . top- $K$ retrieved samples are computed independently for each model. A pair is retained as a

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.