QUICK REVIEW

[论文解读] VSE++: Improved Visual-Semantic Embeddings.

Fartash Faghri, David J. Fleet|arXiv (Cornell University)|Jul 18, 2017

Multimodal Machine Learning Applications被引用 143

一句话总结

本文提出 VSE++，通过用困难负样本挖掘策略替代原始的排名损失，来提升图像-标题检索性能，该策略仅惩罚每个正样本对中最困难的负样本。该方法实现了最先进性能，在 MS-COCO 上将 R@1 提升了 21%，在 Flickr30K 上将 R@1 提升超过一倍，优于先前工作。

ABSTRACT

This paper investigates the problem of image-caption retrieval using joint visual-semantic embeddings. We introduce a very simple change to the loss function used in the original formulation by Kiros et al. (2014), which leads to drastic improvements in the retrieval performance. In particular, the original paper uses the rank loss which computes the sum of violations across the negative training examples. Instead, we penalize the model according to the hardest negative examples. We then make several additional modifications according to the current best practices in image-caption retrieval. We showcase our model on the MS-COCO and Flickr30K datasets through comparisons and ablation studies. On MS-COCO, we improve caption retrieval by 21% in R@1 with respect to the original formulation. Our results outperform the state-of-the-art results by 8.8% in caption retrieval and 11.3% in image retrieval at R@1. On Flickr30K, we more than double R@1 as reported by Kiros et al. (2014) in both image and caption retrieval, and achieve near state-of-the-art performance. We further show that similar improvements also apply to the Order-embeddings by Vendrov et al. (2015) which builds on a similar loss function.

研究动机与目标

通过联合视觉-语义嵌入提升图像-标题检索性能。
解决原始 VSE 中排名损失的局限性，该损失对所有负样本的违规情况进行平均。
探究在训练过程中聚焦最困难负样本的影响。
将当前视觉-语义嵌入的最佳实践应用于实现最先进结果。
展示所提方法在相关模型（如 Order-embeddings）上的泛化能力。

提出的方法

用仅惩罚每个正样本对中最困难负样本的困难负样本挖掘方法，替代原始排名损失。
应用视觉-语言任务中深度学习的标准改进方法，如批量困难样本挖掘和归一化技术。
使用孪生网络架构将图像和标题嵌入到共享嵌入空间。
使用带困难负样本挖掘的对比损失优化模型，以提升正负样本对之间的间隔。
集成归一化和学习率调度以稳定训练并提升收敛性。
将该方法扩展至 Order-embeddings 模型，证明其更广泛的应用潜力。

实验结果

研究问题

RQ1在损失函数中聚焦最困难负样本是否能带来优于对所有负样本平均处理的检索性能？
RQ2与原始排名损失相比，所提出的困难负样本挖掘策略在 R@1 和 R@5 指标上的表现如何？
RQ3困难负样本挖掘带来的性能提升是否可推广至使用类似损失函数的其他模型，如 Order-embeddings？
RQ4标准深度学习最佳实践在多大程度上提升了视觉-语义嵌入在图像-标题检索任务中的性能？
RQ5所提方法在 MS-COCO 和 Flickr30K 等基准数据集上的性能增益如何？

主要发现

在 MS-COCO 上，VSE++ 相较于原始 VSE 方法，将标题检索的 R@1 提升了 21%。
VSE++ 在 MS-COCO 上达到最先进性能，标题检索的 R@1 比之前最先进方法高出 8.8%，图像检索的 R@1 高出 11.3%。
在 Flickr30K 上，VSE++ 相较于 Kiros 等人（2014）报告的原始 VSE 方法，将 R@1 性能提升超过一倍。
尽管相较于基线有显著提升，该模型在 Flickr30K 上仍实现了接近最先进性能。
困难负样本挖掘方法在 Order-embeddings 模型上也表现出有效泛化，证明其更广泛适用性。
消融实验确认，困难负样本损失是性能提升的主要贡献因素，归一化和训练策略的改进也进一步提升了性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。