QUICK REVIEW

[论文解读] Person Search with Natural Language Description

Shuang Li, Tong Xiao|arXiv (Cornell University)|Feb 19, 2017

Multimodal Machine Learning Applications参考文献 41被引用 63

一句话总结

本文介绍 CUHK-PEDES，一个用于基于自然语言描述进行人物检索的大规模数据集，并提出 GNA-RNN，一种门控神经注意力模型，将句子中的单词与视觉单元对齐以对人物图像进行排序。

ABSTRACT

Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search.

研究动机与目标

通过使用自由形式的语言描述，在不依赖图像或预定义属性的情况下推动实际的人物搜索。
创建一个大规模数据集（CUHK-PEDES），为来自再识别集合的人物图像提供丰富的自然语言注释。
在语言引导的人物检索上评估来自描述生成、问答和嵌入范式的多种基线。
提出 GNA-RNN，通过门控神经注意力学习词–图像亲和力以实现鲁棒检索。

提出的方法

介绍 CUHK-PEDES，包含 40,206 张图像，涉及 13,003 名人物和 80,412 条描述外观的句子。
开发视觉子网络，基于类似 VGG-16 的骨干网络产生 512 个视觉单元。
使用语言子网络（LSTM）为每个单词生成对视觉单元的单元级注意力。
引入词级门控，对句子中不同单词的重要性进行加权。
将每个单词的亲和力计算为视觉单元响应的加权和；再对单词进行聚合得到最终亲和力。
端到端训练，使用正负句子-图像对的交叉熵损失，正负比为 1:3。

实验结果

研究问题

RQ1自然语言描述能否在大规模人物检索中优于基于属性的查询？
RQ2哪种数据集和模型结构最好地捕捉描述人物的词–图像关系？
RQ3门控神经注意力机制是否优于像图像描述生成和视觉-语义嵌入等基线，提升句子到图像的亲和力？
RQ4词类型与句子长度如何影响语言引导的人物检索的有效性。

主要发现

CUHK-PEDES 提供 40,206 张图像和 80,412 条句子，为基于语言的人物检索基准提供强大支持。
GNA-RNN 在所提出的数据集上达到最先进的结果，在 top-1 和 top-10 准确率上超过描述生成、问答和嵌入等基线。
单元级注意力和词级门控都对性能有明显贡献；去除任一都会降低结果。
在人员重识别数据上对视觉骨干网络进行预训练显著提升性能。
在测试的单元数量中，512 个视觉单元取得最佳性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。