[论文解读] Cascaded Diffusion Models for High Fidelity Image Generation
本文表明级联扩散模型在没有分类器引导的情况下,能够生成高保真度的 ImageNet 类条件图像,通过在多分辨率级联中进行条件增强,获得强劲的 FID 和 CAS 分数。
We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2.
研究动机与目标
- Demonstrate high-fidelity class-conditional ImageNet generation with cascaded diffusion models without auxiliary classifiers.
- Propose conditioning augmentation to improve sample quality in cascaded pipelines.
- Analyze the impact of multi-resolution cascades and augmentation on sampling quality and training efficiency.
提出的方法
- Build a pipeline of diffusion models across resolutions (e.g., 32×32 → 64×64 → 128×128/256×256).
- Use conditioning augmentation: Gaussian noise on low-res inputs at training and optional blur at higher resolutions.
- Train base diffusion models at low resolution and separate super-resolution models that upsample and refine details.
- Employ architectures based on U-Nets with conditioning inputs injected at multiple points.
- Train with simple or hybrid loss formulations to optimize sample quality while maintaining tractable training.
- Amortize conditioning augmentation across s (truncation time) to enable post-training hyperparameter search.
实验结果
研究问题
- RQ1Can cascaded diffusion pipelines achieve competitive or superior sample quality on ImageNet without classifier guidance?
- RQ2How does conditioning augmentation affect the quality and stability of cascading diffusion models?
- RQ3What are the effects of different resolutions and truncation strategies on FID and CAS metrics?
- RQ4Do conditioning augmentation techniques generalize beyond ImageNet to other datasets like LSUN?
主要发现
- CDM achieves FID scores of 1.48 (64×64), 3.52 (128×128), and 4.88 (256×256) on class-conditional ImageNet, outperforming BigGAN-deep on these resolutions.
- CAS scores at 256×256 reach 63.02% (top-1) and 84.06% (top-5), surpassing VQ-VAE-2 and BigGAN-deep.
- Conditioning augmentation is crucial for high-fidelity samples in cascaded pipelines, mitigating compounding error and exposure bias.
- A two-stage cascade (e.g., 32×32 base, 32×32 → 64×64 SR, then 64×64 → 128×128/256×256 SR) with proper augmentation yields state-of-the-art, classifier-free results on ImageNet at multiple resolutions.
- Non-truncated and truncated conditioning augmentation perform similarly in effect, enabling practical hyperparameter search across augmentation strengths.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。