Skip to main content
QUICK REVIEW

[论文解读] Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Guanshuo Wang, Fufu Yu|arXiv (Cornell University)|Mar 8, 2023
Video Surveillance and Tracking Methods被引用 13
一句话总结

本论文提出 TP-TPS,一种基于 VLP 的文本检索人像搜索框架,充分利用双预训练编码器并引入 MIDC 和 DAP,利用文本潜力实现鲁棒的跨模态对齐和细粒度表示。

ABSTRACT

Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.

研究动机与目标

  • 激发充分利用视觉和语言预训练来进行基于文本的人物检索(TPS).
  • 开发一个使用双预训练编码器并仅做最小微调的基线 VLP-TPS 模型。
  • 引入 MIDC,以强化文本完整性与跨模态对齐。
  • 引入 DAP,提供基于属性的提示,指导对细粒度细节的视觉表征。
  • 展示跨基准的最先进 TPS 结果并分析组件贡献。

提出的方法

  • 基线 TPS,采用双预训练编码器(基于 CLIP 的视觉与文本骨干网络)。
  • 通过令牌池化获取两种模态的细粒度 patch-和 word-level 特征。
  • 多重完整性描述约束(MIDC),生成不完整的属性描述并强制跨模态与完整性损失。
  • 动态属性提示(DAP)生成基于属性的提示,通过文本派生的线索引导视觉特征,且提示用于监督视觉编码器。
  • 联合目标:L = L_cls + L_align + lambda0*L_int + lambda1*L_pmt,另有附加的提示端和完整性约束。
Figure 1 : A typical Text-based Person Search model initialized with Vision-Language Pre-training models. The pre-trained vision encoders are used to initialize the TPS image representation, but textual encoders are altered by external language models such as LSTM or BERT, which is a asymmetry setti
Figure 1 : A typical Text-based Person Search model initialized with Vision-Language Pre-training models. The pre-trained vision encoders are used to initialize the TPS image representation, but textual encoders are altered by external language models such as LSTM or BERT, which is a asymmetry setti

实验结果

研究问题

  • RQ1如何在 VLP-TPS 中充分利用文本编码器,以提升 TPS 的跨模态对齐?
  • RQ2通过利用文本完整性和属性提示,MIDC 与 DAP 是否能为基线带来可量化的增益?
  • RQ3在跨基准测试中,集成基于 CLIP 的文本编码器对 TPS 性能的影响有多大?
  • RQ4MIDC 与 DAP 如何相互作用并促成对细粒度属性表示的改进?

主要发现

Rank1Rank5Rank10mAP
TP-TPS (CUHK-PEDES)70.1686.1090.9866.32
VLP-TPS (CUHK-PEDES)65.3882.7488.9862.47
TP-TPS (ICFG-PEDES)60.6475.9781.7642.78
VLP-TPS (ICFG-PEDES)56.7972.7078.9840.59
TP-TPS (RSTPReid)50.6572.4581.2043.11
VLP-TPS (RSTPReid)45.5568.8577.6040.99
  • TP-TPS 在 CUHK-PEDES 上达到最先进的结果,Rank-1 70.16% ,mAP 66.32%。
  • 在 ICFG-PEDES 上,TP-TPS 获得 Rank-1 60.64% 和 mAP 42.78%。
  • 在 RSTPReid 上,TP-TPS 达到 Rank-1 50.65% 和 mAP 43.11%。
  • 使用 CLIP-TE 提升了基线性能,加入 MIDC 与 DAP 在基线 VLP-TPS 的基础上实现进一步增益。
  • MIDC 通过在部分描述上强制文本完整性来提供持续的改进,而 DAP 提供细粒度属性引导,从而增强视觉表示。
Figure 2 : Overview pipeline of the proposed TP-TPS. A simple baseline framework is developed based on CLIP pre-trained model. Visual and textual token pooling operations are employed to represent token-level fine-grained features for both modalities. We further introduce the Multi-Integrity Descrip
Figure 2 : Overview pipeline of the proposed TP-TPS. A simple baseline framework is developed based on CLIP pre-trained model. Visual and textual token pooling operations are employed to represent token-level fine-grained features for both modalities. We further introduce the Multi-Integrity Descrip

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。