[论文解读] Training Vision Transformers for Image Retrieval
这篇论文表明视觉变换器可以通过一个对比损失和差分熵正则化的 Siamese 变换器架构,有效训练用于图像检索,在类别级别取得最先进结果,在特定对象检索方面也有强势表现。
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.
研究动机与目标
- 证明普通的视觉变换器可以用于图像检索且具有竞争力的准确度
- 研究度量学习损失与变换器骨干网络的交互
- 评估差分熵正则化是否提升嵌入空间的利用率
- 在 SOP、CUB-200-2011 和 In-Shop 上为类别级检索建立最先进的结果
- 评估 Oxford 和 Paris 数据集在特定对象检索上的性能
提出的方法
- 使用一个 Siamese 视觉变换器(IRT)架构将成对图像映射到一个公共嵌入空间
- 以跨批内存的对比损失作为主要监督信号
- 用差分熵正则化项扩充对比损失,促进嵌入空间的均匀性
- 使用现成的 ViT 特征进行实验,使用度量学习进行微调(IRT_O, IRT_L, IRT_R)
- 探索池化变体(CLS token、平均、最大、GeM)和降维(PCA)以获得紧凑描述符
- 使用标准检索指标在 SOP、CUB-200-2011、In-Shop(类别级)以及在 Oxford/Paris(特定对象)上进行训练和评估
实验结果
研究问题
- RQ1一个仅使用度量学习的纯 Vision Transformer 主干是否能在类别级图像检索上相比卷积基线达到有竞争力或更优的性能?
- RQ2用对比损失对 ViT 进行微调是否比现成 ViT 特征提升检索性能?
- RQ3将差分熵正则化项加入对比损失是否进一步提升嵌入空间的利用率和检索准确性?
- RQ4在不同描述符尺寸和图像分辨率下,基于变换器的描述符与卷积描述符在特定对象检索中的表现如何?
主要发现
| 方法 | 骨干网络 | 描述符维度 | SOP Recall@1 | CUB Recall@1 | In-Shop Recall@1 |
|---|---|---|---|---|---|
| IRT_R (ours) | DeiT-S | 128 | 83.4 | 93.0 | 97.0 |
| IRT_R (ours) | DeiT-S | 384 | 84.0 | 93.6 | 97.2 |
- IRT_R with DeiT-S backbones achieves state-of-the-art Recall@1 on SOP, outperforming prior methods by a notable margin
- On CUB-200-2011, DeiT-S 384 with regularized training outperforms prior art at Recall@1
- For In-Shop, DeiT-S 384 yields superior Recall@1 versus prior convnet-based methods
- In particular object retrieval, DeiT-S and DeiT-B variants outperform ResNet-50/101 at 224x224 and scale well to 384x384, with competitive FLOPS
- Differential entropy regularization improves performance across benchmarks and mitigates feature collapse observed with plain contrastive loss
- Transformers show robustness to feature collapse and can match or exceed convnets at comparable capacity and resolutions
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。