QUICK REVIEW

[论文解读] Cascaded Diffusion Models for High Fidelity Image Generation

Jonathan Ho, Chitwan Saharia|arXiv (Cornell University)|May 30, 2021

Generative Adversarial Networks and Image Synthesis参考文献 35被引用 453

一句话总结

本文表明级联扩散模型在没有分类器引导的情况下，能够生成高保真度的 ImageNet 类条件图像，通过在多分辨率级联中进行条件增强，获得强劲的 FID 和 CAS 分数。

ABSTRACT

We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2.

研究动机与目标

Demonstrate high-fidelity class-conditional ImageNet generation with cascaded diffusion models without auxiliary classifiers.
Propose conditioning augmentation to improve sample quality in cascaded pipelines.
Analyze the impact of multi-resolution cascades and augmentation on sampling quality and training efficiency.

提出的方法

Build a pipeline of diffusion models across resolutions (e.g., 32×32 → 64×64 → 128×128/256×256).
Use conditioning augmentation: Gaussian noise on low-res inputs at training and optional blur at higher resolutions.
Train base diffusion models at low resolution and separate super-resolution models that upsample and refine details.
Employ architectures based on U-Nets with conditioning inputs injected at multiple points.
Train with simple or hybrid loss formulations to optimize sample quality while maintaining tractable training.
Amortize conditioning augmentation across s (truncation time) to enable post-training hyperparameter search.

实验结果

研究问题

RQ1Can cascaded diffusion pipelines achieve competitive or superior sample quality on ImageNet without classifier guidance?
RQ2How does conditioning augmentation affect the quality and stability of cascading diffusion models?
RQ3What are the effects of different resolutions and truncation strategies on FID and CAS metrics?
RQ4Do conditioning augmentation techniques generalize beyond ImageNet to other datasets like LSUN?

主要发现

CDM achieves FID scores of 1.48 (64×64), 3.52 (128×128), and 4.88 (256×256) on class-conditional ImageNet, outperforming BigGAN-deep on these resolutions.
CAS scores at 256×256 reach 63.02% (top-1) and 84.06% (top-5), surpassing VQ-VAE-2 and BigGAN-deep.
Conditioning augmentation is crucial for high-fidelity samples in cascaded pipelines, mitigating compounding error and exposure bias.
A two-stage cascade (e.g., 32×32 base, 32×32 → 64×64 SR, then 64×64 → 128×128/256×256 SR) with proper augmentation yields state-of-the-art, classifier-free results on ImageNet at multiple resolutions.
Non-truncated and truncated conditioning augmentation perform similarly in effect, enabling practical hyperparameter search across augmentation strengths.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。