QUICK REVIEW

[论文解读] SEGA: Instructing Text-to-Image Models using Semantic Guidance

Manuel Brack, F. Friedrich|arXiv (Cornell University)|Jan 28, 2023

Generative Adversarial Networks and Image Synthesis被引用 11

一句话总结

SEGA 引入扩散模型的语义引导，使其在无需重新训练的情况下实现零-shot、架构无关和多概念的编辑，方法是通过操作噪声估计空间的稀疏维度。

ABSTRACT

Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.

研究动机与目标

为扩散模型中的语义引导（SEGA）提供形式定义和直觉理解。
证明语义方向在噪声估计空间内具有鲁棒性、单调性，并在很大程度上是孤立的。
演示 SEGA 在不改变架构或重新训练的情况下，执行微妙编辑、构图/风格变换以及艺术引导的概念操作的能力。
评估 SEGA 相对于相关方法的表现，并展示在多种生成模型上的实际应用价值。

提出的方法

在 classifier-free 指导基础上扩展语义引导计算，源自概念条件与无条件噪声估计。
通过分析基于概念提示的 epsilon 估计与无条件估计之间的差异来识别语义方向。
定义一个稀疏、基于尾部的 epsilon 维度选择（lambda 百分位）来形成基本上是孤立的概念向量。
引入一个热身期 delta 和一个动量项，用于控制在何时以及如何应用引导并加速一致性编辑。
允许通过对单个 gamma_t 引导项（gamma_i）的带权和来组合多个概念，并使用每个概念的超参数。
提供一种与实现无关的表述，适用于潜在空间和像素级扩散模型，并附有公开代码实现。

(a) A (latent) diffusion process inherently organizes concepts and learns implicitly relationships between them, although there is no supervision.

实验结果

研究问题

RQ1在无需训练或架构更改的情况下，能否从扩散模型的噪声估计空间中提取语义方向？
RQ2语义引导向量是否在跨提示和跨领域中表现出鲁棒性、唯一性、单调性和孤立性？
RQ3SEGA 是否能够同时执行多项不干扰且可控强度的编辑？
RQ4在编辑成功率和对原始组成的保真度方面，SEGA 与现有扩散编辑方法相比如何？
RQ5SEGA 是否能够减轻不良内容或在不同架构中引导生成远离不当概念？

主要发现

语义引导向量可以从噪声估计中提取，并可通过一次前向传播应用。
引导向量在跨域上具有鲁棒性，对每个概念在很大程度上是唯一的，且其效应随着引导强度单调扩展。
不同的概念向量在很大程度上是孤立的，允许同时编辑而不干扰，并实现多概念操控。
SEGA 在多项编辑任务中优于可比方法，并提高对原始构图的保真度，同时实现风格迁移和对象移除。
在人脸和 I2P 基准测试中的实验显示，SEGA 在不同架构上实现了高编辑成功率，并对不当内容有强烈抑制作用。
定性与用户研究证据表明，SEGA 的编辑被感知为可信，且结果优于若干基线方法。

(b) Guidance arithmetic: Guiding the image ‘a portrait of a king’ (left) using ‘king’ $-$ ‘male’ $+$ ‘female’ results in an image of a ‘queen’ (right).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。