QUICK REVIEW

[论文解读] ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Patrick Esser, Robin Rombach|arXiv (Cornell University)|Aug 19, 2021

Generative Adversarial Networks and Image Synthesis参考文献 68被引用 51

一句话总结

ImageBART 引入一个自粗到细的层次框架，通过对多项扩散过程取反来向自回归图像合成注入双向上下文，从而实现高保真生成和灵活的本地编辑。

ABSTRACT

Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

研究动机与目标

通过引入双向上下文，激励并克服自回归图像生成中的单向注意力偏差。
开发一个自粗到细的分层模型，使用固定的多项扩散过程来压缩图像并为自回归步骤提供全局上下文。
在无需专门的掩码训练的情况下，实现灵活的条件图像合成与局部、用户引导的编辑。
通过在离散潜在空间中训练马尔可夫链来使扩散过程取反，从而实现高保真度生成。
在多样的数据集上展示改进的修改能力和具竞争力的样本质量。

提出的方法

学习一个分层的分布序列 p^t_theta，其中 x_0 是数据，x_T 是粗略表示，形成 x_{0:T}，使得 x_{t-1} ~ p^{t-1}_{theta}(x_{t-1}|x_t)。
使用前向多项扩散 q_theta 逐步将 x_{t-1} 污染成 x_t，从而实现可处理的 KL 上界和基于 ELBO 的训练目标（方程 2）。
第一阶段 (L1) 通过向量量化自编码器学习图像的离散压缩表示，具有重建损失和对抗真实感（L_rec, L_adv）。
后续阶段 (L_t, t>1) 利用来自更粗表示的全局上下文来建模更细的层次，使用以 x_t 为条件、并被编码器表示（交叉注意力）所关注的自回归解码器。
以编码器-解码器变换器自回归地建模每个反向过程 p^{t-1}_{theta}(x_{t-1}|x_t)，实现双向上下文，而不需要在所有反向步骤之间共享权重。
通过固定的 beta_t 二项式/多项扩散步骤来处理前向过程 q_theta，允许对 t>2 求解析 KL 项，对 t=2 使用蒙特卡罗估计（方程 7–8）。
在各尺度上并行训练层次结构，以避免严重的损失加权和梯度噪声，每个数据集选择 T（例如 FFHQ=3，ImageNet 条件=5）。
通过在 p^{t-1}_{theta} 前缀添加标记来实现灵活的条件化，支持类别条件和文本生成到图像（第 4.2 节）。

实验结果

研究问题

RQ1如何在不破坏可处理的密度分解的前提下，将双向全局上下文引入自回归图像合成？
RQ2相比于纯自回归或像素空间扩散模型，自粗到细的离散分层扩散框架是否能提升图像保真度和编辑能力？
RQ3模型是否能够在无需掩码特定训练的情况下，支持灵活的条件（类别标签、文本）和自由形式的局部编辑（基于掩码）？
RQ4在这样的分层结构中，扩散步数、模型容量和采样速度之间的权衡是什么？
RQ5在无条件和有条件生成任务中，该方法在各种数据集上的表现如何？

主要发现

ImageBART 通过通过自粗到细的层次逐步融入全局上下文，实现高保真度的图像合成，相较于纯自回归模型在连贯性方面有所提升。
基于多项扩散的前向过程与自回归的反向过程使训练高效，并在不需要高样本复杂度的情况下实现大规模上下文集成。
该模型支持多种条件（类别标签和文本）并实现局部编辑，包括自由形式的基于掩码的局部修复，而无需针对掩码的任务特定训练。
实证结果显示在多个数据集上与先前的似然基方法和分数基方法相比具有竞争力或优越性，尤其是对于复杂场景（如 ImageNet、LSUN 变体）。
调整扩散步数（T）揭示了权衡：步数越多在修改和全局连贯性上提升，但无条件生成在中等 T 之后收益递减。
经验上，跨级别独立扩展训练并在每个层级使用固定前向扩散，可以实现并行优化和稳定训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。