Skip to main content
QUICK REVIEW

[论文解读] Cascaded Diffusion Models for High Fidelity Image Generation

Jonathan Ho, Chitwan Saharia|arXiv (Cornell University)|May 30, 2021
Generative Adversarial Networks and Image Synthesis参考文献 35被引用 453
一句话总结

本文表明级联扩散模型在没有分类器引导的情况下,能够生成高保真度的 ImageNet 类条件图像,通过在多分辨率级联中进行条件增强,获得强劲的 FID 和 CAS 分数。

ABSTRACT

We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2.

研究动机与目标

  • Demonstrate high-fidelity class-conditional ImageNet generation with cascaded diffusion models without auxiliary classifiers.
  • Propose conditioning augmentation to improve sample quality in cascaded pipelines.
  • Analyze the impact of multi-resolution cascades and augmentation on sampling quality and training efficiency.

提出的方法

  • Build a pipeline of diffusion models across resolutions (e.g., 32×32 → 64×64 → 128×128/256×256).
  • Use conditioning augmentation: Gaussian noise on low-res inputs at training and optional blur at higher resolutions.
  • Train base diffusion models at low resolution and separate super-resolution models that upsample and refine details.
  • Employ architectures based on U-Nets with conditioning inputs injected at multiple points.
  • Train with simple or hybrid loss formulations to optimize sample quality while maintaining tractable training.
  • Amortize conditioning augmentation across s (truncation time) to enable post-training hyperparameter search.

实验结果

研究问题

  • RQ1Can cascaded diffusion pipelines achieve competitive or superior sample quality on ImageNet without classifier guidance?
  • RQ2How does conditioning augmentation affect the quality and stability of cascading diffusion models?
  • RQ3What are the effects of different resolutions and truncation strategies on FID and CAS metrics?
  • RQ4Do conditioning augmentation techniques generalize beyond ImageNet to other datasets like LSUN?

主要发现

  • CDM achieves FID scores of 1.48 (64×64), 3.52 (128×128), and 4.88 (256×256) on class-conditional ImageNet, outperforming BigGAN-deep on these resolutions.
  • CAS scores at 256×256 reach 63.02% (top-1) and 84.06% (top-5), surpassing VQ-VAE-2 and BigGAN-deep.
  • Conditioning augmentation is crucial for high-fidelity samples in cascaded pipelines, mitigating compounding error and exposure bias.
  • A two-stage cascade (e.g., 32×32 base, 32×32 → 64×64 SR, then 64×64 → 128×128/256×256 SR) with proper augmentation yields state-of-the-art, classifier-free results on ImageNet at multiple resolutions.
  • Non-truncated and truncated conditioning augmentation perform similarly in effect, enabling practical hyperparameter search across augmentation strengths.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。