QUICK REVIEW

[论文解读] StyleDrop: Text-to-Image Generation in Any Style

Kihyuk Sohn, Nataniel Ruiz|arXiv (Cornell University)|Jun 1, 2023

Generative Adversarial Networks and Image Synthesis被引用 25

一句话总结

StyleDrop 通过使用适配器对文本到图像转换器进行微调，使用至多一张参考图像，在用户指定的风格下学习与生成图像，并结合迭代反馈以提高风格保真度和内容-风格的解耦。

ABSTRACT

Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than $1\%$ of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https://styledrop.github.io

研究动机与目标

激发并实现对文本到图像生成的忠实风格化，超越宽泛或模糊提示。
证明极小的微调规模即可从极少量数据中捕捉到复杂的风格细微差别。
提出一个带反馈的迭代训练框架，以减少过拟合和内容泄漏。
展示将风格与内容结合的组合能力，以及混合多个适配器以实现风格和内容个人化的能力。

提出的方法

以 Muse 作为基础文本到图像转换器，采用掩蔽的视觉标记建模。
通过适配器进行参数高效微调，以在保持基础模型固定的同时学习风格特定参数。
通过将内容描述与显式风格描述符相结合来构建风格提示，以促进内容-风格的解耦。
引入带反馈的迭代训练（基于 CLIP 或人工评估），以选取高质量的合成图像用于再训练。
实现来自两个适配器（风格和内容）的采样，以将风格与独立的主体表征混合。
提供一个采样方程，将来自风格适配和内容适配生成器的分布混合以产生复合输出。

Figure 1 : Visualization of StyleDrop outputs generated by personalized text-to-image models for $18$ different styles. Each model is tuned on a single style reference image, which is shown in the white insert box of each image. The per-style text descriptor is appended to the content text prompt: “

实验结果

研究问题

RQ1StyleDrop 是否能够从极少量参考图像中捕捉并转移任意视觉风格？
RQ2基于适配器的微调在文本到图像模型的风格迁移中是否优于全量微调或基于扩散的基线？
RQ3带反馈的迭代过程（CLIP 或人工评估）如何影响风格保真度与内容解耦？
RQ4StyleDrop 是否能够实现使用分离的风格和内容适配器的组合生成，以实现灵活的个性化？

主要发现

StyleDrop 在仅使用一张参考图像的情况下实现高风格保真度和内容-风格解耦。
与 Imagen 或 Stable Diffusion 上的 DreamBooth 和文本反演相比，Muse 上的 StyleDrop 在风格一致性方面更具优势，并且在 CLIP 和人工评估中文本对齐度具有竞争力甚至更好。
带反馈的迭代训练（HF 或 CF）提高了保真度（文本保真），在风格保真度上有因合成数据漂移带来的一些权衡。
描述性风格提示实现细粒度风格编辑和属性级控制，超越 rarе-token 方法。
来自两个适配器的采样允许在不对内容和风格进行联合优化的情况下，将主体以所选风格进行组合成像。

Figure 2 : A simplified architecture of transformer layers of Muse [ 4 ] with modification to support parameter-efficient fine-tuning (PEFT) with adapter [ 12 , 32 ] . $L$ layers of transformers are used to process a sequence of visual tokens in green conditioned on the text embedding $e$ . Learnabl

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。