QUICK REVIEW

[论文解读] Improving Diffusion-Based Image Synthesis with Context Prediction

L. Yang, Jingwei Liu|arXiv (Cornell University)|Jan 4, 2024

Generative Adversarial Networks and Image Synthesis被引用 8

一句话总结

介绍 ConPreDiff，一种用于扩散模型的上下文预测框架，通过上下文解码器强化每个像素/令牌以预测其邻域上下文，在无条件、文本到图像以及修复任务中提升图像生成，同时不增加推理成本。

ABSTRACT

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

研究动机与目标

激发并解决扩散模型中逐点重建的局限性，该局限性可能忽略局部邻域上下文。
提出一种上下文预测机制，在训练过程中加强每个点以推断邻域上下文。
开发一种高效的邻域上下文解码策略，使用基于分布的预测和 Wasserstein 距离。
将邻域预测重新表述为分布预测，以避免参数规模膨胀带来的不可承受成本。
给出一个理论联系，表明在特定聚合条件下，ConPreDiff 损失上界标准 DDPM 目标。
通过在训练中添加上下文损失项，将 ConPreDiff 广义化到离散和连续扩散骨干网络，且不改变推理过程。

提出的方法

在去噪网络末端附近添加一个上下文预测头，以预测每个点的多步邻域上下文。
将邻域信息表示为多步邻居的分布，并通过神经网络解码。
使用基于 Wasserstein 距离的损失，使解码得到的邻域分布与真实上下文对齐，从而实现高效的大上下文解码。
将邻域预测重新表述为分布预测，以避免巨大的参数增长。
给出一个理论联系，表明在特定聚合条件下，ConPreDiff 损失上界标准 DDPM 目标。
通过在训练中添加上下文损失项，将 ConPreDiff 广义化到离散和连续扩散骨干网络，且不改变推理过程。

实验结果

研究问题

RQ1显式邻域上下文预测是否能够在基于扩散的图像合成中提升保真度和多样性？
RQ2通过分布来预测邻域上下文（而非对整个像素/特征解码）是否能高效扩展到大范围上下文？
RQ3ConPreDiff 是否与离散和连续扩散骨干网络在各种视觉任务中兼容并带来收益？
RQ4不同邻域步长对生成质量和训练效率的影响是什么？

主要发现

ConPreDiff 在文本到图像生成和图像修复任务中优于先前的扩散模型和非扩散模型。
离散和连续的 ConPreDiff 在 MS-COCO 文本到图像生成上实现了新的最先进 FID 分数。
上下文预测应用于现有扩散骨干网络时，生成质量稳定提升。
基于分布的邻域解码结合 Wasserstein 损失，使得大上下文建模在可控的计算成本下成为可能。
上下文增强在无条件图像生成、文本到图像生成和修复方面均有提升，收益归因于更好地保留局部上下文。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。