QUICK REVIEW

[论文解读] Cascade Attention Network for Person Search: Both Image and Text-Image Similarity Selection.

Ya Jing, Chenyang Si|arXiv (Cornell University)|Sep 22, 2018

Multimodal Machine Learning Applications参考文献 16被引用 10

一句话总结

本文提出了一种姿态引导的多粒度注意力网络（PMA），用于基于文本的人体检索，通过利用粗粒度和细粒度的注意力机制，并借助姿态信息将全局描述与短语级语义与相应的图像区域对齐。该方法在CUHK-PEDES数据集上的top-1检索准确率相比最先进方法提升了15%。

ABSTRACT

Text-based person search aims to retrieve the corresponding person images in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different granularities of semantic relevance, which is usually ignored in previous methods. To exploit the multilevel corresponding visual contents, we propose a pose-guided multi-granularity attention network (PMA). Firstly, we propose a coarse alignment network (CA) to select the related image regions to the global description by a similarity-based attention. To further capture the phrase-related visual body part, a fine-grained alignment network (FA) is proposed, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.

研究动机与目标

通过将自然语言描述与相关图像区域对齐，解决跨模态人体检索的挑战。
克服现有方法在图像-文本对应关系中忽略多层次语义粒度的局限性。
通过建模全局和短语级视觉-语言对齐，提升检索准确率。
利用人体姿态信息，增强身体部位与描述中名词短语之间的细粒度对齐。

提出的方法

提出一种粗粒度对齐网络（CA），利用基于相似度的注意力机制，选择与整体人物描述相关的图像区域。
设计一种细粒度对齐网络（FA），利用姿态估计来引导特定身体部位与文本中名词短语之间的注意力。
以级联方式集成两个网络，逐步从粗粒度到细粒度优化视觉-语义匹配。
利用姿态信息作为监督信号，提升短语级别对齐的准确性。
在CUHK-PEDES数据集上端到端训练模型，采用联合优化目标进行图像和文本嵌入学习。
采用注意力机制，根据文本查询语义动态加权相关视觉特征。

实验结果

研究问题

RQ1文本描述与图像区域之间的多粒度语义对齐如何提升人体检索性能？
RQ2姿态引导的注意力在基于文本的人体检索中，对细粒度对齐的提升程度如何？
RQ3结合全局与短语级对齐的级联注意力机制是否能超越现有的单粒度方法？
RQ4所提出方法在CUHK-PEDES基准上与最先进模型相比表现如何？

主要发现

所提方法在CUHK-PEDES数据集上的top-1检索准确率相比最先进方法实现了15%的相对提升。
粗粒度对齐网络能有效识别与整体人物描述相关的图像区域。
在姿态信息引导下的细粒度对齐网络，显著提升了特定身体部位与文本短语之间的对齐效果。
级联注意力机制实现了在多个粒度级别上视觉-语义匹配的逐步优化。
该模型通过有效捕捉文本-图像匹配中的全局与局部语义相关性，展现出优越的泛化能力。
将姿态信息作为监督信号，增强了模型定位与名词短语对应相关身体部位的能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。