QUICK REVIEW

[论文解读] Multimodal Unsupervised Image-to-Image Translation

Xun Huang, Ming-Yu Liu|arXiv (Cornell University)|Apr 12, 2018

Generative Adversarial Networks and Image Synthesis参考文献 74被引用 297

一句话总结

本文介绍了 MUNIT，这是一个用于多模态无监督图像到图像翻译的框架，将图像分解为共享内容编码和域特定风格编码，从而在没有成对数据的情况下实现多样且可控的翻译。

ABSTRACT

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT

研究动机与目标

通过建模多模态输出来解决无监督图像到图像翻译中缺乏多样性的问题。
提出一种跨域共享的内容-风格解耦表示，以及域特定的风格编码。
通过以目标风格图像为条件实现示例引导的翻译。
提供理论分析，展示模型的潜在分布、联合分布与弱循环一致性等性质。
在多个数据集上展示相对于最先进方法的更高图像质量和更丰富的多样性。

提出的方法

将图像分解为共享内容编码和域特定风格编码。
通过将源的内容编码与来自目标域的随机选取风格编码互换来进行翻译。
通过对抗损失与双向重构损失的组合来训练，以对齐分布并实现编码器/解码器的可逆。
使用基于 AdaIN 的解码器，风格条件的仿射参数由一个 MLP 生成。
在最优时强制域不变的内容分布，并通过双向重构强化风格增强的循环一致性。
使用人类偏好、LPIPS 多样性，以及针对多模态输出的条件化 Inception Score 进行评估。

实验结果

研究问题

RQ1无监督图像翻译是否可以是多模态的，以反映目标域外观的多样性？
RQ2内容是否可以共享（领域不变）而风格保持域特定，以支持多对多映射？
RQ3在没有成对数据的情况下，能否通过示例图像控制翻译风格？
RQ4在最优时，学习到的潜在分布和联合分布是否符合理论预期？
RQ5该方法在质量和多样性方面是否与有监督和无监督基线方法竞争？

主要发现

方法	质量 (edges→shoes)	多样性 (edges→shoes)	质量 (edges→handbags)	多样性 (edges→handbags)	备注
UNIT [15]		0.011	0.0	0.023
CycleGAN [8]		0.010	0.0	0.012
CycleGAN* [8] with noise		0.016	0.0	0.011
MUNIT w/o Lx recon	0.0	0.213	0.0	0.191
MUNIT w/o Lc recon	0.0	0.172	0.0	0.185
MUNIT w/o Ls recon	0.0	0.070	0.0	0.139
MUNIT	0.0	0.109	0.0	0.175
BicycleGAN [11] †	0.0	0.104	0.0	0.140	trained with paired supervision
Real data	N/A	0.293	N/A	0.371

MUNIT 在没有成对数据的情况下生成多样且高质量的翻译，在若干任务上超越了无监督基线。
该模型在动物翻译任务上达到高 CIS 和 IS，表明质量和多样性很强。
消融研究显示去除重构损失会降低质量或多样性，而完整的 MUNIT 在某些设置中达到或超过某些有监督方法。
支持示例引导翻译，通过风格图像控制目标风格。
理论结果表明在最优时潜在分布匹配和联合分布的一致性，以及风格增强的循环一致性约束。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。