QUICK REVIEW

[논문 리뷰] Multimodal Unsupervised Image-to-Image Translation

Xun Huang, Ming-Yu Liu|arXiv (Cornell University)|2018. 04. 12.

Generative Adversarial Networks and Image Synthesis참고 문헌 74인용 수 297

한 줄 요약

본 논문은 멀티모달 비지도 이미지-대-이미지 변환 프레임워크인 MUNIT을 소개하며, 이미지를 공유 콘텐츠 코드와 도메인 특화 스타일 코드로 분해하여 쌍 데이터 없이도 다양한, 제어 가능한 번역을 가능하게 한다.

ABSTRACT

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT

연구 동기 및 목표

무지도 이미지-대-이미지 번역에서 다양성의 부족을 멀티모달 출력을 모델링함으로써 해결한다.
도메인 간에 공유되는 콘텐츠-스타일 분리 표현과 도메인 특화 스타일 코드를 제안한다.
대상 스타일 이미지를 조건으로 예시 가이드를 통한 번역을 가능하게 한다.
모델의 잠재 분포, 결합 분포 및 약한 순환 일관성 특성을 이론적으로 분석한다.
최신 방법과 비교해 여러 데이터셋에서 이미지 품질과 다양성이 우수함을 보여준다.]
method:[
Decompose images into a shared content code and a domain-specific style code.
Translate by swapping the content code of the source with a randomly drawn style code from the target domain.
Train with a combination of adversarial losses and bidirectional reconstruction losses to align distributions and invert encoders/decoders.
Use AdaIN-based decoders with style-conditioned affine parameters generated by an MLP.
Impose domain-invariant content distributions at optimum and enforce a style-augmented cycle-consistency via bidirectional reconstructions.
Evaluate with human preferences, LPIPS diversity, and a Conditional Inception Score tailored to multimodal outputs.

제안 방법

이미지를 공유 콘텐츠 코드와 도메인 특화 스타일 코드로 분해한다.
소스의 콘텐츠 코드를 대상 도메인에서 임의로 뽑은 스타일 코드와 교환하여 번역한다.
분포를 정렬하고 인코더/디코더를 역연산하도록 적대적 손실과 양방향 재구성 손실의 조합으로 학습한다.
MLP로 생성된 스타일 조건부 선형 매개변수를 갖는 AdaIN 기반 디코더를 사용한다.
최적에서 도메인 불변 콘텐츠 분포를 강제하고 양방향 재구성을 통한 스타일 보강 순환 일관성을 적용한다.
인간의 선호도, LPIPS 다양성, 멀티모달 출력을 위한 조건부 인셉션 스코어를 이용해 평가한다.

실험 결과

연구 질문

RQ1무지도 이미지 번역을 여러 가능한 대상 도메인 모양을 반영하도록 멀티모달로 만들 수 있는가?
RQ2콘텐츠가 공유(도메인 불변)되는 한편 스타일은 도메인 특이성을 유지하여 다 대 다 매핑을 지원할 수 있는가?
RQ3쌍 데이터 없이 예시 이미지를 통해 번역 스타일을 제어할 수 있는가?
RQ4학습된 잠재 분포와 결합 분포가 최적에서 이론적 기대와 일치하는가?
RQ5제안 방법이 감독 및 비감독 기반선과 비교해 품질 및 다양성 면에서 경쟁력이 있는가?

주요 결과

MUNIT은 쌍 데이터 없이도 다양하고 높은 품질의 번역을 생성하며, 여러 작업에서 비지도 baselines를 능가한다.
모델은 동물 번역에서 높은 CIS 및 IS를 달성하여 품질과 다양성이 강함을 나타낸다.
Ablations는 재구성 손실을 제거하면 품질이나 다양성이 저하되며, 전체 MUNIT는 특정 설정에서 일부 감독 방법과 견주어도 뒤지지 않거나 더 나은 성능을 보인다.
예시 가이드 번역이 가능해 스타일 이미지로 대상 스타일에 대한 제어가 가능하다.
이론적 결과는 최적에서 잠재 분포 매칭 및 결합 분포 일관성, 그리고 스타일 보강 순환 일관성 제약을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.