[论文解读] CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
CM-GANs 通过使用带权重共享自编码器和双判别器的跨模态 GANs 学习区分度高的跨模态通用表示,在多个数据集上实现了最先进的跨模态检索性能。
It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap that makes it challenging to correlate such heterogeneous data. Generative adversarial networks (GANs) have shown its strong ability of modeling data distribution and learning discriminative representation, existing GANs-based works mainly focus on generative problem to generate new data. We have different goal, aim to correlate heterogeneous data, by utilizing the power of GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs to learn discriminative common representation for bridging heterogeneity gap. The main contributions are: (1) Cross-modal GANs architecture is proposed to model joint distribution over data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form generative model. They can not only exploit cross-modal correlation for learning common representation, but also preserve reconstruction information for capturing semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make common representation more discriminative by adversarial training process. To the best of our knowledge, our proposed CM-GANs approach is the first to utilize GANs to perform cross-modal common representation learning. Experiments are conducted to verify the performance of our proposed approach on cross-modal retrieval paradigm, compared with 10 methods on 3 cross-modal datasets.
研究动机与目标
- 推动弥合图像与文本模态在异质性方面的差距,以实现跨模态检索。
- 提出一个跨模态 GAN 框架,通过建模联合分布来学习判别性强的通用表示。
- 在通过对抗训练强化跨模态相关性的同时,保留模态内的语义重建。
- 引入权重共享的跨模态自编码器以学习共享表示并保持模态特异信息。
提出的方法
- 引入带有共享最终层权重的跨模态卷积自编码器(G_I 和 G_T),以学习公共表示 (s_p^i, s_p^t) 并重建表示 (r_p^i, r_p^t)。
- 使用两个并行的 GAN:一个模态内判别器 (D_I, D_T) 用于区分原始数据与重建数据,另一个跨模态判别器 (D_Ci, D_Ct) 用于跨模态通用表示。
- 提出两种对抗损失:L_GAN1 用于模态内重建,L_GAN2 用于模态间相关性,在极小极大目标中结合。
- 通过跨模态对抗过程进行训练,交替更新判别模型和生成模型,以在学习判别性通用表示方面相互促进。
- 利用编码器最终层的权重共享以及 softmax 约束来强制模态之间的语义对齐。
实验结果
研究问题
- RQ1基于 GAN 的架构能否学习出能够关联不同模态(图像和文本)异质数据的判别性通用表示?
- RQ2带有模态内与模态间判别器的跨模态对抗训练是否提升跨模态检索性能?
- RQ3权重共享的跨模态自编码器是否在实现跨模态相关性的同时有效地保留模态内的语义?
主要发现
- CM-GANs 在三个数据集上与 10 种最先进的跨模态检索方法相比,取得最佳检索精度。
- 在 Wikipedia、Pascal Sentence 以及作者的 XMediaNet 数据集上的跨模态检索任务中展示了有效性。
- 表明带权重共享的跨模态卷积自编码器能够捕捉跨模态相关性,同时在每种模态内保持语义的一致性。
- 验证了所提出的跨模态对抗机制作为提升判别性通用表示学习的一种手段。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。