QUICK REVIEW

[论文解读] Text-Based Person Search with Limited Data

Han Xiao, Sen He|arXiv (Cornell University)|Oct 20, 2021

Multimodal Machine Learning Applications参考文献 42被引用 38

一句话总结

本文提出 CM-MoCo，一种跨模态动量对比学习框架，以及来自大规模图文数据的迁移学习策略，在数据有限的情况下提升基于文本的人员检索，并在 CUHK-PEDES 上达到 SOTA。

ABSTRACT

Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query. Solving such a fine-grained cross-modal retrieval task is challenging, which is further hampered by the lack of large-scale datasets. In this paper, we present a framework with two novel components to handle the problems brought by limited data. Firstly, to fully utilize the existing small-scale benchmarking datasets for more discriminative feature learning, we introduce a cross-modal momentum contrastive learning framework to enrich the training data for a given mini-batch. Secondly, we propose to transfer knowledge learned from existing coarse-grained large-scale datasets containing image-text pairs from drastically different problem domains to compensate for the lack of TBPS training data. A transfer learning method is designed so that useful information can be transferred despite the large domain gap. Armed with these components, our method achieves new state of the art on the CUHK-PEDES dataset with significant improvements over the prior art in terms of Rank-1 and mAP. Our code is available at https://github.com/BrandonHanx/TextReID.

研究动机与目标

通过更高效地利用有限的基准数据集来解决 TBPS（基于文本的人员检索）标注数据稀缺的问题。
通过基于动量的对比学习丰富跨模态负样本以提升判别能力。
通过一个小心设计的跨模态迁移学习策略，利用来自大规模图像-文本对的知识以缓解域差异。

提出的方法

引入跨模态动量对比学习（CM-MoCo），具有分开的可视和文本查询编码器及动量键编码器，以及用于可视、文本和身份的专用队列。
构建跨模态对比损失，使用查询编码器作为锚点，键编码器作为正样本，队列作为负样本。
在端到端训练框架中将 CM-MoCo 与对齐损失和身份损失结合。
提出一种跨模态迁移学习策略，冻结来自大型预训练模型的文本编码器，并用 Bi-GRU 对词嵌入进行上下文化，以缩小域差。
通过跨模态 k-reciprocal 重排序进行后处理，以进一步提升检索性能。

实验结果

研究问题

RQ1通过将负样本与批量大小解耦，跨模态动量对比学习（CM-MoCo）是否能有效利用有限的 TBPS 数据？
RQ2当域差较大时，从大规模图像-文本预训练中迁移知识是否有助于 TBPS？应如何执行此迁移以避免负迁移？
RQ3在 CUHK-PEDES 上，CM-MoCo、对齐损失和身份损失的哪些组合能取得最佳的 TBPS 性能？
RQ4哪种文本模态的迁移学习设计最能缓解 TBPS 数据与通用图像-文本数据之间的域差？

主要发现

在 CUHK-PEDES 上，CM-MoCo 在文本到图像和图像到文本检索方面显著优于基线。
仅传输词嵌入（通过带 Bi-GRU 上下文化的冻结 CLIP 文本编码器）自大型图像-文本数据集可获得显著提升且避免负迁移。
在 CM-MoCo 中使用更大的跨模态队列（如 1024 或 2048）通常会提升性能，但队列过大在数据稀缺时可能产生不利影响。
所提出的文本流迁移策略（词嵌入加上下文化）优于天真地进行整模型迁移。
将 CM-MoCo 集成后，跨模型对性能有一致提升，在 Rank 指标上平均提升约 1.5%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。