QUICK REVIEW

[论文解读] Unified Multi-Dataset Training for TBPS

Nilanjana Chatterjee, Sidharatha Garg|arXiv (Cornell University)|Jan 21, 2026

Video Surveillance and Tracking Methods被引用 0

一句话总结

Scale-TBPS 通過噪聲感知的數據篩選與可擴展的判別性身份學習目標，在多個 TBPS 數據集上訓練單一統一的文本基礎人員搜索模型，表現超越數據集特定與天真的聯合訓練方法。

ABSTRACT

Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.

研究动机与目标

超越以數據集為中心的 TBPS，訓練一個可處理多分佈的單一統一模型的動機。
合併 TBPS 數據集時，減輕跨數據集的噪聲與分佈移位。
開發可擴展的身份學習，使身份數量增長時仍具辨識力。
證明統一訓練可超越獨立訓練、以數據集為中心的模型。

提出的方法

Noise-Aware Unified Dataset Curation (NDC) 使用一組預訓練 TBPS 模型的集合來篩選不可靠的文本–圖像對，而不使用硬閾值。
Discriminative Identity Learning (DIL) 引入 Multimodal Angular Identity loss，以強制對圖像與文本模態的角度邊界。
使用共享的多模態分類器權重向量 w 來計算所有身份的角度邊界基 logits。
訓練結合 Multimodal Angular Identity loss 與 ranking loss，以優化跨模態對齊與判別性。
該方法以 CLIP 為基礎的編碼器為基礎，並以可擴展的角度邊界目標進行擴展。

Figure 1: Illustration of Scale-TBPS. (a) illustrates the conventional dataset-centric training paradigm, where separate models are independently trained for different distributions, resulting in isolated models. (b) depicts naive joint training, where a single model is trained on merged datasets; h

实验结果

研究问题

RQ1能否在具有不同分佈的多個 TBPS 數據集上有效訓練單一模型？
RQ2在大規模 TBPS 中，如何在不丟失有用數據的前提下，篩選嘈雜的跨數據集文本–圖像對？
RQ3以角度邊界為基礎之辨識性身份學習目標，是否可在大量跨數據集身份下擴展？
RQ4測試時相似度正規化對統一 TBPS 模型的檢索性能有何影響？

主要发现

單一 Scale-TBPS 模型，搭配 NDC 與 DIL，在多個 TBPS 基準上與數據集特定和天真聯合訓練方法相匹配或超越。
Scale-TBPS 在多個 CLIP 基礎與非 CLIP 基線下，取得更優的平均精度均值（mAP）與排序指標。
測試時的相似度正規化（NNN）在檢索表現上帶來顯著提升，尤其在某些數據集中。
NDC 模組在一次性預處理步驟中有效篩選嘈雜對，使多個 TBPS 數據集的可擴展合併成為可能。
DIL 可視化顯示相比天真聯合訓練，類內聚簇更緊密、類間分離更清晰。

Figure 2: Overview of the proposed Scale-TBPS. (a) Noise-Aware Data Curation (NDC): Text–image pairs from the joint dataset ( $\mathcal{D}$ ) are encoded using a set of pretrained and frozen models $\Phi$ . top- $K$ retrieved samples are computed independently for each model. A pair is retained as a

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。