[论文解读] Semantic Image Synthesis via Diffusion Models
本论文提出 Semantic Diffusion Model (SDM),一个基于 DDPM 的框架,分开处理语义布局和带噪声的图像,采用类似 SPADE 的条件,实现在语义图像合成中的保真度和多样性方面,且具有分类器自由引导。
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks compared with Generative Adversarial Nets (GANs). Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches, which may lead to unsatisfactory quality or diversity of generated images. In this paper, we propose a novel framework based on DDPM for semantic image synthesis. Unlike previous conditional diffusion model directly feeds the semantic layout and noisy image as input to a U-Net structure, which may not fully leverage the information in the input semantic mask, our framework processes semantic layout and noisy image differently. It feeds noisy image to the encoder of the U-Net structure while the semantic layout to the decoder by multi-layer spatially-adaptive normalization operators. To further improve the generation quality and semantic interpretability in semantic image synthesis, we introduce the classifier-free guidance sampling strategy, which acknowledge the scores of an unconditional model for sampling process. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method, achieving state-of-the-art performance in terms of fidelity (FID) and diversity (LPIPS). Our code and pretrained models are available at https://github.com/WeilunWang/semantic-diffusion-model.
研究动机与目标
- Develop a diffusion-model framework for semantic image synthesis that surpasses GAN-based methods in fidelity and diversity.
- Leverage separate processing of semantic masks and noisy inputs to better utilize semantic information.
- Improve sampling quality and semantic correspondence via classifier-free guidance.
- Demonstrate strong performance on Cityscapes, ADE20K, CelebAMask-HQ, and COCO-Stuff datasets.
提出的方法
- Use a conditional denoising diffusion network (SDM) where the noisy image goes through the encoder while the semantic layout is injected into the decoder via multi-layer spatially-adaptive normalization (SPADE).
- Adopt SDEResblocks in the encoder with attention and timestep-aware scaling for denoising.
- Inject semantic information into the decoder with SPADE-like conditioning (SDDResblock) to guide denoising.
- Train with a denoising loss plus a variances prediction loss to maximize likelihood (L_simple + lambda * L_vlb).
- Apply classifier-free guidance by mixing conditional and unconditional predictions during sampling to boost fidelity and semantic alignment (epsilon_theta(y_t|x) + s*(epsilon_theta(y_t|x) - epsilon_theta(y_t|empty)).
- Optionally perform multimodal, diverse generation by leveraging the stochastic diffusion process.]
- research_questions:[
实验结果
研究问题
- RQ1Can a diffusion-based framework outperform GAN-based methods in fidelity and diversity for semantic image synthesis?
- RQ2Does separating the conditioning information (semantic mask) from the noisy image improve semantic relevance and visual quality?
- RQ3What is the impact of classifier-free guidance on fidelity and semantic alignment in conditional diffusion models?
- RQ4How does SDM perform on four benchmark datasets in terms of FID, LPIPS, and mIoU-based semantic interpretability?
主要发现
| 方法 | CelebAMask-HQ FID | CelebAMask-HQ LPIPS | Cityscapes FID | Cityscapes LPIPS | ADE20K FID | ADE20K LPIPS | COCO-Stuff FID | COCO-Stuff LPIPS |
|---|---|---|---|---|---|---|---|---|
| Pix2PixHD [48] | 38.5 | 0 | 95.0 | 0 | 81.8 | 0 | 111.5 | 0 |
| SPADE [31] | 29.2 | 0 | 71.8 | 0 | 22.6 | 0 | 33.9 | 0 |
| DAGAN [44] | 29.1 | 0 | 60.3 | 0 | 31.9 | 0 | n/a | 0 |
| SCGAN [50] | 20.8 | 0 | 49.5 | 0 | 29.3 | 0 | 18.1 | 0 |
| CLADE [43] | 30.6 | 0 | 57.2 | 0 | 35.4 | 0 | 29.2 | 0 |
| CC-FPSE [24] | n/a | n/a | 54.3 | 0.026 | 31.7 | 0.078 | 19.2 | 0.098 |
| GroupDNet [57] | 25.9 | 0.365 | 47.3 | 0.101 | 41.7 | 0.230 | n/a | n/a |
| INADE [42] | 21.5 | 0.415 | 44.3 | 0.295 | 35.2 | 0.459 | n/a | n/a |
| OASIS [41] | n/a | n/a | 47.7 | 0.327 | 28.3 | 0.286 | 17.0 | 0.328 |
| SDM (Ours) | 18.8 | 0.422 | 42.1 | 0.362 | 27.5 | 0.524 | 15.9 | 0.518 |
- SDM achieves state-of-the-art FID and LPIPS on four benchmarks compared to prior methods.
- Embedding semantic layouts via multi-layer SPADE-like conditioning in the decoder markedly improves fidelity and semantic relevance over simple concatenation.
- Classifier-free guidance substantially improves mIoU and FID with a modest change in LPIPS, yielding better semantic alignment.
- SDM provides high-quality, diverse semantic image synthesis, including multimodal generation and capable semantic editing in real images.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。