QUICK REVIEW

[论文解读] Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

Robin Rombach, Andreas Blattmann|arXiv (Cornell University)|Jul 26, 2022

Advanced Image and Video Retrieval Techniques被引用 34

一句话总结

该论文通过在推理阶段用风格特定的图像集合替换检索数据库，实现对扩散模型的零-shot 文本引导风格化，从而在不重新训练的情况下进行艺术图像合成，并在细粒度风格化方面优于基于后缀的提示。它提供基于 LAION- 和 WikiArt/ArtBench 的设置以及开源代码/模型权重。

ABSTRACT

Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of ``AI-Art'', which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-called ``prompt-engineering'' has become established, in which carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image. In this note, we present an alternative approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains, for example, only images of a particular visual style. This provides a novel way to prompt a general trained model after training and thereby specify a particular visual style. As shown by our experiments, this approach is superior to specifying the visual style within the text prompt. We open-source code and model weights at https://github.com/CompVis/latent-diffusion .

研究动机与目标

为扩散模型提供一个可控的事后风格化方法，降低重新训练的需求。
利用检索增强扩散模型（RDMs）将生成条件建立在来自外部数据库的有信息量的图像样本之上。
证明在推理阶段交换训练风格数据库即可实现对风格的细粒度控制。
证明基于 CLIP 的文本–图像空间能够实现自然语言驱动的风格指定。

提出的方法

使用在 OpenImages（ImageNet 副本）或 LAION-2B-en 上训练的检索增强扩散模型作为训练数据库。
在推理阶段用风格特定数据集（WikiArt）或 ArtBench 风格子集替换训练数据库以实现风格化。
查询 CLIP 图像嵌入空间，从风格数据库检索最近的 k 个邻居（k=19）用于条件化。
在训练和推理期间通过对检索到的 CLIP 嵌入进行交叉注意来对扩散进行条件化。
使用在 ArtBench 上训练的风格分类器评估风格化质量，以与 postfix-based 提示进行比较。
提供可重复性的开源代码和模型权重。

实验结果

研究问题

RQ1检索增强扩散模型是否能够通过在推理时交换外部数据库来实现零-shot 风格化？
RQ2基于 CLIP 的检索是否能够在无需额外训练的情况下实现对生成艺术品的细粒度、风格特定控制？
RQ3基于检索的风格化在准确性和风格辨识度上与传统的 postfix 风格提示相比如何？
RQ4使用不同风格数据集（WikiArt、ArtBench）对合成质量和可控性有哪些影响？

主要发现

通过在推理阶段用风格特定数据库替换训练数据库即可实现零-shot 风格化。
在所测试的艺术风格中，基于检索的风格化在细粒度风格控制方面优于 postfix 风格提示。
在 ArtBench 上训练的风格分类器相比 postfix 基提示获得更高的与检索风格的对齐度（给出定量比较）。
探索了两种模型配置：一个类似 ImageNet 的 RDM 和一个基于 LAION-2B-en 的更大 RDM，具有兼容的检索设置（k=19 邻居）。
该方法支持事后风格化且无需重新训练，并通过专门的数据集（WikiArt、ArtBench）实现定向风格化。
代码和模型权重已发布，供艺术家扩展和评估该方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。