QUICK REVIEW

[论文解读] Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Chenyang Gao, Guanyu Cai|arXiv (Cornell University)|Jan 8, 2021

Video Surveillance and Tracking Methods参考文献 24被引用 61

一句话总结

NAFS 引入全尺度的图像和文本表示，并通过上下文非局部注意力自适应对齐不同尺度，在 CUHK-PEDES 数据集上达到最先进的结果。

ABSTRACT

Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modal gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. So comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a single scale (e.g. only global scale or only partial scale) then simply construct alignment at each scale separately. To address this problem, we propose a method that is able to adaptively align image and textual features across all scales, called NAFS (i.e.Non-local Alignment over Full-Scale representations). Firstly, a novel staircase network structure is proposed to extract full-scale image features with better locality. Secondly, a BERT with locality-constrained attention is proposed to obtain representations of descriptions at different scales. Then, instead of separately aligning features at each scale, a novel contextual non-local attention mechanism is applied to simultaneously discover latent alignments across all scales. The experimental results show that our method outperforms the state-of-the-art methods by 5.53% in terms of top-1 and 5.35% in terms of top-5 on text-based person search dataset. The code is available at https://github.com/TencentYoutuResearch/PersonReID-NAFS

研究动机与目标

推动基于文本的人物检索，并解决图像和描述之间的类别间方差较小的问题。
在多尺度学习图像与文本表示，以捕捉全面的跨模态线索。
开发一个自适应对齐机制，能够跨所有尺度联合对齐特征。
提出一种重新排序方法以进一步优化检索结果。
通过在 CUHK-PEDES 上的大量实验来证明其有效性。

提出的方法

引入阶梯式 CNN 主干网络以提取具备改进局部性的全尺度可视表示。
使用受局部性约束的 BERT 获取多尺度文本表示。
提出上下文非局部注意力以跨所有尺度联合对齐图像与文本。
采用跨尺度对齐损失（CSAL），结合图像到文本和文本到图像的相似性。
引入基于视觉邻居的重新排序（RVN）以提升最终排名。

实验结果

研究问题

RQ1如何为基于文本的人物检索提取图像与文本的全尺度多尺度表示？
RQ2自适应跨尺度对齐是否能超越固定尺度配对来提升匹配？
RQ3上下文非局部注意力在跨尺度的跨模态对齐中的影响是什么？
RQ4基于视觉邻居的重新排序是否能进一步提升检索性能？

主要发现

采用全尺度表示的 NAFS 在 CUHK-PEDES 上达到 Top-1 59.94% 和 Top-5 79.86%，超越若干基线。
结合 RVN 的 NAFS 进一步将 Top-1 提升至 61.50% 和 Top-5 提升至 81.19%。
跨尺度的联合对齐优于分离对齐（Top-1：59.94% 对 57.98%）。
结合阶梯网络、局部性约束的 BERT 和上下文非局部注意力相对于基线带来显著提升。
消融分析表明中尺度信息（尺度 2）在全局与最细尺度特征之外也有益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。