QUICK REVIEW

[论文解读] TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

Weihao Xia, Yujiu Yang|arXiv (Cornell University)|Dec 6, 2020

Generative Adversarial Networks and Image Synthesis参考文献 43被引用 31

一句话总结

TediGAN 通过在多模态输入上学习 GAN 反演，将文本引导的图像生成与操作统一到一个框架中，在 StyleGAN 的潜在空间中对齐视觉与语言嵌入，并应用实例级优化以保持身份，从而实现高质量的 1024x1024 结果与多模态合成。

ABSTRACT

In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

研究动机与目标

阐明对高质量、灵活的文本引导人脸生成与操控的需求。
开发一个在同一模型内同时支持从文本进行生成与操控的统一框架。
结合 GAN 反演将真实图像映射到 StyleGAN 的潜在空间，以实现语义上有意义的编辑。
学习一种跨模态嵌入，将视觉与语言表示对齐到同一空间。
通过实例级优化在操控过程中保持身份不变。

提出的方法

StyleGAN 反演模块，将真实图像映射到 StyleGAN 的 W 潜在空间，结合像素级和语义级重建损失。
视觉-语言相似性学习，将图像和文本投射到具有分层潜在编码的共同 W-space。
实例级优化以在对反演编码进行细化的同时，将其正则化到编码器的语义域。
基于风格混合的控制机制，通过交换选定的 StyleGAN 层来实现生成或操控。
通过将它们视为风格码并应用逐层混合，支持多模态输入（草图、标签、图像）。
提出 Multi-Modal CelebA-HQ 数据集，包含图像、分割图、草图和文本描述，用于训练与评估。

实验结果

研究问题

RQ1单一个框架是否能够在高分辨率下联合执行文本驱动的图像生成与操控？
RQ2如何将多模态输入（文本、草图、标签）整合到一个共享潜在空间以实现可控的合成？
RQ3实例级优化是否在文本引导的操控中提高身份保持？
RQ4哪些数据集和评估指标最能反映多模态文本引导的人脸合成性能？

主要发现

在 1024^2 分辨率下实现多样且高质量的人脸图像。
在 Multi-Modal CelebA-HQ 上的文本到图像生成中，在 FID、LPIPS、准确度和真实感方面超过最先进方法。
在文本引导的图像操控方面，在 FID、准确度和真实感方面超过 ManiGAN。
通过跨输入模态的风格混合展现有效的多模态合成。
表明逐层分析将高层属性与 StyleGAN 层次结构中的细粒度属性对齐。
引入 Multi-Modal CelebA-HQ 数据集，以推动基于文本与模态引导的合成研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。