QUICK REVIEW

[论文解读] Dual-Path Convolutional Image-Text Embedding.

Zhedong Zheng, Liang Zheng|arXiv (Cornell University)|Nov 15, 2017

Multimodal Machine Learning Applications参考文献 36被引用 47

一句话总结

本文提出一种用于联合图像-文本嵌入的双路卷积神经网络，采用带有ReLU激活和批量归一化的端到端可训练CNN，联合优化视觉与文本特征。引入具有大间隔优化的实例损失，在语言人像检索任务上达到最先进性能，在Flickr30k和MSCOCO上也取得具有竞争力的结果。

ABSTRACT

This paper considers the task of matching images and sentences. The challenge consists in discriminatively embedding the two modalities onto a shared visual-textual space. Existing work in this field largely uses Recurrent Neural Networks (RNN) for text feature learning and employs off-the-shelf Convolutional Neural Networks (CNN) for image feature extraction. Our system, in comparison, differs in two key aspects. Firstly, we build a convolutional network amenable for fine-tuning the visual and textual representations, where the entire network only contains four components, i.e., convolution layer, pooling layer, rectified linear unit function (ReLU), and batch normalisation. End-to-end learning allows the system to directly learn from the data and fully utilise the supervisions. Secondly, we propose instance loss according to viewing each multimodal data pair as a class. This works with a large margin objective to learn the inter-modal correspondence between images and their textual descriptions. Experiments on two generic retrieval datasets (Flickr30k and MSCOCO) demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language person retrieval, we improve the state of the art by a large margin. Code is available at this https URL com/layumi/Image-Text-Embedding

研究动机与目标

为解决在共享空间中进行判别性视觉-文本嵌入以实现图像-句子匹配的挑战。
克服现有方法中基于RNN的文本编码器和现成CNN的局限性。
通过轻量级、全卷积架构实现视觉与文本表征的端到端学习。
通过一种新颖的实例损失与大间隔目标，提升跨模态对应学习能力。

提出的方法

该模型采用具有共享组件的双路架构：图像和文本流均使用卷积层、ReLU激活、批量归一化和池化层。
整个网络可端到端训练，允许直接从原始数据使用完整监督信号进行优化。
引入实例损失，将每个图像-文本对视为唯一类别，以增强判别性学习。
损失与大间隔目标相结合，以强化模态间对齐并提升泛化能力。
使用反向传播和监督对比学习原理，端到端训练网络。

实验结果

研究问题

RQ1全卷积网络是否能在无需RNN的情况下有效学习联合视觉-文本表征？
RQ2与标准对比损失相比，具有大间隔优化的实例损失在跨模态匹配方面有何改进？
RQ3轻量级CNN架构的端到端训练是否优于使用预训练RNN和现成CNN的模型？
RQ4该方法在多样化检索任务（包括零样本和语言人像检索）中的泛化能力如何？

主要发现

该方法在Flickr30k和MSCOCO检索基准上表现具有竞争力，达到最先进水平。
在语言人像检索任务上显著超越现有最先进方法，展现出在零样本设置下的强大泛化能力。
端到端训练方案实现了比使用预训练组件的模型更优的特征对齐。
具有大间隔目标的实例损失显著增强了判别能力，尤其在细粒度匹配任务中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。