QUICK REVIEW

[论文解读] Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Guanshuo Wang, Fufu Yu|arXiv (Cornell University)|Mar 8, 2023

Video Surveillance and Tracking Methods被引用 13

一句话总结

本论文提出 TP-TPS，一种基于 VLP 的文本检索人像搜索框架，充分利用双预训练编码器并引入 MIDC 和 DAP，利用文本潜力实现鲁棒的跨模态对齐和细粒度表示。

ABSTRACT

Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.

研究动机与目标

激发充分利用视觉和语言预训练来进行基于文本的人物检索（TPS）.
开发一个使用双预训练编码器并仅做最小微调的基线 VLP-TPS 模型。
引入 MIDC，以强化文本完整性与跨模态对齐。
引入 DAP，提供基于属性的提示，指导对细粒度细节的视觉表征。
展示跨基准的最先进 TPS 结果并分析组件贡献。

提出的方法

基线 TPS，采用双预训练编码器（基于 CLIP 的视觉与文本骨干网络）。
通过令牌池化获取两种模态的细粒度 patch-和 word-level 特征。
多重完整性描述约束（MIDC），生成不完整的属性描述并强制跨模态与完整性损失。
动态属性提示（DAP）生成基于属性的提示，通过文本派生的线索引导视觉特征，且提示用于监督视觉编码器。
联合目标：L = L_cls + L_align + lambda0*L_int + lambda1*L_pmt，另有附加的提示端和完整性约束。

Figure 1 : A typical Text-based Person Search model initialized with Vision-Language Pre-training models. The pre-trained vision encoders are used to initialize the TPS image representation, but textual encoders are altered by external language models such as LSTM or BERT, which is a asymmetry setti

实验结果

研究问题

RQ1如何在 VLP-TPS 中充分利用文本编码器，以提升 TPS 的跨模态对齐？
RQ2通过利用文本完整性和属性提示，MIDC 与 DAP 是否能为基线带来可量化的增益？
RQ3在跨基准测试中，集成基于 CLIP 的文本编码器对 TPS 性能的影响有多大？
RQ4MIDC 与 DAP 如何相互作用并促成对细粒度属性表示的改进？

主要发现

Rank1	Rank5	Rank10	mAP
TP-TPS (CUHK-PEDES)	70.16	86.10	90.98	66.32
VLP-TPS (CUHK-PEDES)	65.38	82.74	88.98	62.47
TP-TPS (ICFG-PEDES)	60.64	75.97	81.76	42.78
VLP-TPS (ICFG-PEDES)	56.79	72.70	78.98	40.59
TP-TPS (RSTPReid)	50.65	72.45	81.20	43.11
VLP-TPS (RSTPReid)	45.55	68.85	77.60	40.99

TP-TPS 在 CUHK-PEDES 上达到最先进的结果，Rank-1 70.16% ，mAP 66.32%。
在 ICFG-PEDES 上，TP-TPS 获得 Rank-1 60.64% 和 mAP 42.78%。
在 RSTPReid 上，TP-TPS 达到 Rank-1 50.65% 和 mAP 43.11%。
使用 CLIP-TE 提升了基线性能，加入 MIDC 与 DAP 在基线 VLP-TPS 的基础上实现进一步增益。
MIDC 通过在部分描述上强制文本完整性来提供持续的改进，而 DAP 提供细粒度属性引导，从而增强视觉表示。

Figure 2 : Overview pipeline of the proposed TP-TPS. A simple baseline framework is developed based on CLIP pre-trained model. Visual and textual token pooling operations are employed to represent token-level fine-grained features for both modalities. We further introduce the Multi-Integrity Descrip

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。