QUICK REVIEW

[论文解读] Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Zefeng Ding, Changxing Ding|arXiv (Cornell University)|Jul 27, 2021

Video Surveillance and Tracking Methods参考文献 63被引用 79

一句话总结

SSAN 自动学习文本到图像人 ReID 的语义对齐的部件级视觉与文本特征，使用 Word Attention Module、multi-view non-local relations，以及 Compound Ranking loss，还有一个新的 ICFG-PEDES 数据集。

ABSTRACT

Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view non-local network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES. Extensive experiments demonstrate that SSAN outperforms state-of-the-art approaches by significant margins. Both the new ICFG-PEDES database and the SSAN code are available at https://github.com/zifyloo/SSAN.

研究动机与目标

解决跨模态文本到图像 ReID 中的挑战：文本内部类别方差较大，以及单词-身体部位映射的可变性。
在不使用外部工具的情况下，自动推导与视觉区域对齐的部件级文本特征。
通过多视角非局部交互建模身体部位之间的关系，以更好地匹配名词短语。
通过利用同一身份的其他图像描述的 Compound Ranking loss，减少文本内部类别方差。
提供一个新的、更具挑战性且以身份为中心的数据集 (ICFG-PEDES) 以推动文本到图像 ReID 研究。

提出的方法

通过对视觉特征图的均匀划分提取部件级视觉特征。
用双向LSTM处理描述以获得词表示。
使用 Word Attention Module (WAM) 预测词-部位关联并生成部件级文本特征。
应用共享的 1x1 卷积将全局视觉特征与文本特征对齐到共同空间（全局分支）。
在部件分支中引入 Part-specific Feature Learning (PFL) 和 Part Relation Learning (PRL)，以获得语义对齐的部件特征。
在两种模态中使用 Multi-View Non-Local Network (MV-NLN) 捕捉部内和部间关系并细化部件特征。
提出一个 Compound Ranking (CR) loss，结合强监督和弱监督项，带自适应边距，以把同一身份的其他图像描述作为监督来利用。
使用全局、PFL 和 PRL 特征进行训练，结合 ID loss 与 CR loss；推理时对三种模态相似度求和（S_g、S_l、S_n）。

实验结果

研究问题

RQ1是否可以在没有外部文本工具的情况下，自动为文本到图像 ReID 提取语义自对齐的部件特征？
RQ2通过 MV-NLN 建模部间关系是否能提升跨模态对齐和检索性能？
RQ3利用同一身份的其他图像描述的 Compound Ranking loss 能否降低文本内部类别方差？
RQ4所提出的 SSAN 架构是否在标准数据集和新引入的数据集上优于现有的文本到图像 ReID 方法？

主要发现

在加入 PFL（部件特征学习）时，SSAN 在 CUHK-PEDES 的 Rank-1 基线提升了 4.58 个百分点，在 ICFG-PEDES 提升了 3.56 个百分点。
增加 PRL（部件关系学习）额外提升了 1.33 个百分点（CUHK-PEDES）和 0.95 点（ICFG-PEDES）。
引入 CR loss 在 Rank-1 方面又带来额外的 1.62 点（CUHK-PEDES）和 1.21 点（ICFG-PEDES）的提升。
SSAN 在 CUHK-PEDES 上超越了现有方法，在 Rank-1 的准确率方面比 ViTAA 高出 5.4%，在其他等级上也具有竞争力的表现。
SSAN 在完整模型（Global + PFL + MV-NLN + CR loss）下取得最强结果，显著优于基线和以往的部分特征方法。
作者发布 ICFG-PEDES 作为一个以身份为中心、细粒度的数据集，具有更长的字幕描述和更具挑战性的图像，以支持未来研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。