QUICK REVIEW

[论文解读] RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

Fengxiang Bie, Yibo Yang|arXiv (Cornell University)|Sep 2, 2023

Generative Adversarial Networks and Image Synthesis被引用 9

一句话总结

对文本到图像生成在 GAN、VAE 和扩散模型中的方法进行全面综述，强调大模型和像 CLIP 这样的多模态编码器对 TTI 质量的影响及未来方向。

ABSTRACT

Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.

研究动机与目标

介绍文本到图像（TTI）模型的关键组成部分，包括生成模型、语言模型和视觉模型。
在大模型影响下，调查多种 TTI 模型类型（GAN、VAE、扩散）的演变。
通过视觉和统计结果进行跨类型比较，以评估优缺点。
讨论局限性并勾画未来方向，包括对视频和 3D 生成的扩展。

提出的方法

从基于 GAN 的方法过渡到扩散和大模型增强方法，梳理 TTI 的演变。
总结 VAE、GAN 和基于扩散的 TTI 模型的核心架构与学习目标。
解释大语言模型和视觉-语言编码器（如 CLIP）在引导图像生成中的作用。
通过定性（图像）和定量（统计）标准比较模型类型，并讨论权衡。
强调多模态和多任务学习如何影响 TTI 的性能与效率。

实验结果

研究问题

RQ1推动 GAN、VAE 与扩散系列的 TTI 模型的主要架构和组成部分是什么？
RQ2大模型和多模态编码器（如 CLIP）如何影响 TTI 的质量、效率和多样性？
RQ3GAN、自回归和扩散方法在文本到图像生成中的优点和局限性是什么？
RQ4未来 TTI 研究有哪些方向与扩展（如视频或 3D 生成）值得关注？

主要发现

扩散模型已成为高保真 TTI 生成的突出选择。
大模型和多模态编码器显著提升 TTI 的性能和能力。
没有单一模型类型具备绝对优势；每种架构在质量、效率和可扩展性上各有权衡。
CLIP 及语言-视觉与多模态学习是现代 TTI 系统及零样本能力的基础。
该综述整合了跨类型的比较（视觉与统计）并讨论优缺点以指导未来工作。
未来工作设想将 TTI 技术扩展到如视频和 3D 生成等复杂任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。