QUICK REVIEW

[论文解读] UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis

Zhu Zhang, Jianxin Ma|arXiv (Cornell University)|May 29, 2021

Generative Adversarial Networks and Image Synthesis参考文献 65被引用 23

一句话总结

UFC-BERT 提出了一种非自回归、两阶段的框架，通过将所有输入和输出表示为由 Transformer 处理的离散标记序列，统一了多种多模态控制——文本、参考图像和图像块。该方法在 M2C-Fashion 和 Multi-Modal CelebA-HQ 数据集上验证了其在高保真度、一致性的图像生成方面，相较于复杂控制的自回归基线模型，实现了更高的速度和更强的控制合规性。

ABSTRACT

Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, UFC-BERT, to unify any number of multi-modal controls. In UFC-BERT, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be processed by Transformer. Different from existing two-stage autoregressive approaches such as DALL-E and VQGAN, UFC-BERT adopts non-autoregressive generation (NAR) at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Further, we design a progressive algorithm that iteratively improves the non-autoregressively generated image, with the help of two estimators developed for evaluating the compliance with the controls and evaluating the fidelity of the synthesized image, respectively. Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal CelebA-HQ verify that UFC-BERT can synthesize high-fidelity images that comply with flexible multi-modal controls.

研究动机与目标

将多种多模态控制——文本、参考图像和图像块——统一到一个条件图像生成框架中。
解决自回归生成在速度、整体图像一致性以及图像块保留方面的局限性。
开发一种非自回归生成策略，在保持高图像质量的同时实现更快的推理速度。
引入一种渐进式精炼机制，配备专门的估计器以实现控制合规性和图像保真度。
在大规模、多样化的数据集（包括 M2C-Fashion 和 Multi-Modal CelebA-HQ）上验证该框架。

提出的方法

将所有输入控制（文本、参考图像、图像块）和输出图像表示为离散标记序列，通过 Transformer 编码器实现统一处理。
采用两阶段架构：第一阶段，条件 VQ-VAE 编码控制并生成潜在代码；第二阶段，非自回归 Transformer 直接生成图像标记。
引入一种渐进式精炼算法，通过两个估计器迭代优化非自回归生成的图像。
使用控制合规性估计器衡量生成结果与输入控制（文本、参考图像、图像块）的一致性，使用保真度估计器评估感知质量。
利用学习到的先验分布和迭代精炼，在不采用自回归生成的前提下提升图像质量。
采用离散标记空间表示控制和图像生成，支持端到端训练和统一建模。

实验结果

研究问题

RQ1统一框架能否有效处理条件图像生成中的多样化多模态控制（文本、参考图像、图像块）？
RQ2与自回归基线相比，第二阶段的非自回归生成是否能在图像一致性与生成速度方面实现显著提升？
RQ3通过专用估计器实现的迭代精炼，能否在保留复杂控制信号的同时实现高保真度图像生成？
RQ4该框架在包含复杂控制组合的大规模、多样化数据集上泛化能力如何？
RQ5与现有两阶段自回归模型相比，该方法在保真度、速度和控制合规性方面优势有多大？

主要发现

UFC-BERT 实现了高保真度图像生成，并对多模态控制（包括复杂图像块的保留）表现出强大的合规性。
与自回归模型相比，非自回归生成阶段显著提升了推理速度，同时保持了图像质量。
通过专用估计器实现的渐进式精炼，有效在多轮迭代中提升图像质量和控制对齐程度。
在 M2C-Fashion 和 Multi-Modal CelebA-HQ 数据集上，UFC-BERT 在保真度和一致性指标上均优于现有两阶段自回归模型。
该框架在包括文本、参考图像和图像块在内的多样化控制组合中表现出稳健的泛化能力。
统一的基于标记的表示方式，使得在单一 Transformer 架构中有效建模异构控制信号成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。