QUICK REVIEW

[论文解读] TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Shengcai Liao, Ling Shao|arXiv (Cornell University)|May 30, 2021

Video Surveillance and Tracking Methods被引用 38

一句话总结

TransMatcher 通过简化的跨图像匹配解码器和全局最大池化来改造 Transformer，以实现高效、可泛化的人体再识别，并在多个数据集上达到最新的SOTA。

ABSTRACT

Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation, which is not naturally suitable for image matching. Accordingly, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, global max pooling and a multilayer perceptron (MLP) head are applied to decode the matching result. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching. The proposed method, called TransMatcher, achieves state-of-the-art performance in generalizable person re-identification, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively, on several popular datasets. Code is available at https://github.com/ShengcaiLiao/QAConv.

研究动机与目标

研究 Transformer 是否能够在成对图像之间进行图像匹配和度量学习，以实现可泛化的人体再识别。
评估 ViT（Vision Transformer）和标准 Transformer 在跨图像匹配中的局限性。
提出一个轻量、以相似性为焦点的解码器，以实现跨图像匹配。
在标准数据集和合成数据集上评估泛化能力，并与SOTA方法进行对比。

提出的方法

使用 ResNet 主干从查询图像和库图像中提取特征。
使用 Transformer 编码器分别对查询和库进行编码，得到 Q_n 和 K_n。
应用一个简化解码器，通过变换后的特征和共享全连接层计算查询-库的相似性，随后进行全局最大池化和一个 MLP 头以产生成对分数。
引入可学习的先验分数嵌入，用于对局部相似匹配进行加权。
将解码器在N层的输出融合以实现残差相似性学习。
采用遵循 QAConv-GS 框架的成对度量学习目标进行训练。

实验结果

研究问题

RQ1Vision Transformer（ViT）或普通 Transformer 能否泛化到成对图像之间的显式匹配，用于人体再识别？
RQ2天真解（查询-库并联或带输入查询的跨注意力）能否改善跨图像匹配？
RQ3聚焦直接相似性计算的简化解码器是否能提高 Re-ID 中度量学习的效率和性能？
RQ4跨图像交互对在不同数据集和合成数据上的泛化有何影响？

主要发现

TransMatcher 在多个数据集上实现了可泛化的人体再识别的最新性能。
以 Market-1501 作为源数据进行训练，在 CUHK03-NP 上提升 Rank-1 5.8% 和 mAP 5.7%，在 MSMT17 上提升 Rank-1 6.1% 和 mAP 3.4%。
以 MSMT17 作为源数据进行训练，在 Market-1501 上提升 Rank-1 5.0% 和 mAP 5.3%，在 MSMT17 上提升 Rank-1 6.1% 和 mAP 3.4%（如所述）。
以 RandPerson（合成数据）进行训练，使 Market-1501 的 Rank-1 提升 3.3%、mAP 提升 5.3%，在 MSMT17 的 Rank-1 提升 5.9%、mAP 提升 3.3%。
与 Transformer-Cross 相比，TransMatcher 提供显著更好的跨匹配性能（例如在 Market-1501 上，TransMatcher 比 Transformer-Cross 高出约 11% 的 Rank-1 和 9% 的 mAP）。
消融研究表明简化解码器、GMP 硬注意力和先验分数嵌入对获得最佳准确性非常重要；而编码器中的位置嵌入在该设计中可能降低性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。