QUICK REVIEW

[論文レビュー] Multimodal Unsupervised Image-to-Image Translation

Xun Huang, Ming-Yu Liu|arXiv (Cornell University)|Apr 12, 2018

Generative Adversarial Networks and Image Synthesis参考文献 74被引用数 297

ひとこと要約

本論文はMUNITを導入し、画像を共有コンテンツコードとドメイン固有のスタイルコードに分解することで、対になったデータ無しでも多様で制御可能な翻訳を実現するフレームワークを提示する。

ABSTRACT

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT

研究の動機と目的

unsupervised image-to-image translationにおける多様性の欠如を、多 modalityな出力をモデリングすることで解決する。
ドメイン間で共有されるコンテンツ-スタイルの分離表現と、ドメイン固有のスタイルコードを提案する。
対象のスタイル画像を条件付けして、例示に基づく翻訳を可能にする。
モデルの潜在分布・結合分布・弱いサイクル整合性の性質を理論的に分析する。
最先端手法と比較して、複数データセットでの画像品質と多様性が優れることを示す。

提案手法

画像を共有コンテンツコードとドメイン特有のスタイルコードに分解する。
ソースのコンテンツコードとターゲットドメインからランダムに取り出したスタイルコードを入替えて翻訳を行う。
敵対的損失と双方向再構成損失の組み合わせで分布を合わせ、エンコーダ/デコーダを反転可能に訓練する。
スタイル条件付パラメータを持つAdaINベースのデコーダを用い、MLPで生成したスタイル関連のアファイン変換を適用する。
最適解でドメイン不変のコンテンツ分布を課し、双方向再構成を介してスタイルを強化したサイクル整合性を課す。
評価は人間の好み、LPIPS多様性、および多 modal出力に特化したConditional Inception Scoreで行う。

実験結果

リサーチクエスチョン

RQ1無監視画像翻訳を多 modalityに拡張して、ターゲットドメインの多様な出現を反映できるか。
RQ2コンテンツが共有（ドメイン不変）で、スタイルがドメイン固有のまま、多対多の写像を支えられるか。
RQ3対になったデータなしで、例画像を使って翻訳スタイルを制御できるか。
RQ4学習された潜在分布と結合分布が最適性で理論的期待と一致するか。
RQ5提案手法は、教師あり・教師なしのベースラインと比較して品質と多様性の点で競争力があるか。

主な発見

Method	Quality (edges→shoes)	Diversity (edges→shoes)	Quality (edges→handbags)	Diversity (edges→handbags)	Notes
UNIT [15]		0.011	0.0	0.023
CycleGAN [8]		0.010	0.0	0.012
CycleGAN* [8] with noise		0.016	0.0	0.011
MUNIT w/o Lx recon	0.0	0.213	0.0	0.191
MUNIT w/o Lc recon	0.0	0.172	0.0	0.185
MUNIT w/o Ls recon	0.0	0.070	0.0	0.139
MUNIT	0.0	0.109	0.0	0.175
BicycleGAN [11] †	0.0	0.104	0.0	0.140	trained with paired supervision
Real data	N/A	0.293	N/A	0.371

MUNITはペアデータなしでも多様で高品質な翻訳を生成し、複数のタスクで教師なしベースラインを上回る。
動物翻訳で高いCISとISを達成し、品質と多様性が高いことを示す。
アブレーションにより再構成損失を除去すると品質または多様性が低下する一方で、完全なMUNITは特定の設定でいくらかの教師あり手法と同等またはそれを上回る。
例示に基づく翻訳がサポートされ、スタイル画像を用いてターゲットスタイルを制御できる。
潜在分布の一致と結合分布の最適性での整合性、スタイルを強化したサイクル整合性制約を理論的に示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。