QUICK REVIEW

[论文解读] Joint Multimodal Learning with Deep Generative Models

Masahiro Suzuki, Kotaro Nakayama|arXiv (Cornell University)|Nov 7, 2016

Speech and dialogue systems被引用 123

一句话总结

本文提出联合多模态变分自编码器（JMVAE），用于建模多模态的联合分布并实现双向生成，同时提出 JMVAE-kl 以在模态缺失时防止潜变量崩溃。

ABSTRACT

We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.

研究动机与目标

动机在于学习一个联合表示，能够捕捉跨越不同模态（如图像和文本）的高层概念。
开发一个生成模型，通过建模 p(x, w) 而非仅仅 p(x|w) 或 p(w|x)，实现模态的双向互换。
提出在生成过程中处理缺失模态的机制，以避免潜在变量崩溃。
证明联合表示在多模态数据集上能提升生成和重建质量。

提出的方法

定义一个联合多模态 VAE（JMVAE），其中每种模态独立地以共享潜在变量 z 为条件，建模 p(x, w) = p(x|z)p(w|z)。
使用变分推断训练编码器和解码器，以最大化 log p(x, w) 的下界。
提出 JMVAE-kl，通过添加一个基于 KL 散度的正则化项（α 参数），使单模态编码器 q(z|x) 和 q(z|w) 与多模态编码器 q(z|x, w) 对齐。
将目标与信息变差（variation of information, VI）联系起来，以证成双向交换的合理性并将训练解释为 VI 的最小化。
扩展到多于两模态，并讨论面向模态特定架构的实际训练（例如高斯、伯努利、基于 CNN 的解码器）。
在 MNIST 和 CelebA 上进行实验，包括一个 JMVAE-GAN 变体以提升图像生成质量。

实验结果

研究问题

RQ1能否从多模态中学习到的联合潜在表示支持对每种模态的准确生成和重建？
RQ2将每种模态独立地以共享潜在变量为条件，是否比条件化 VAE 更能实现双向生成（x 从 w，w 从 x）？
RQ3在测试时一模态或多模态缺失时，JMVAE-kl 如何影响样本质量？
RQ4该方法能否扩展到维度和结构差异很大的模态（例如图像与二进制属性）？

主要发现

JMVAE 能提取联合表示，在 MNIST 和 CelebA 上提升或达到单模态对数似然度。
JMVAE 实现模态之间的双向生成，包括从属性生成图像，反之亦然。
JMVAE-kl 变体在模态缺失时显著缓解样本崩溃，改善条件和边际对数似然。
在 CelebA 上，JMVAE 及其 GAN 增强变体在边际和条件对数似然方面均优于竞争的多模态模型。
联合多模态学习在定性结果上优于单模态基线（例如基于属性的人脸生成）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。