QUICK REVIEW

[论文解读] Improved Vector Quantized Diffusion Models

Zhicong Tang, Shuyang Gu|arXiv (Cornell University)|May 31, 2022

Generative Adversarial Networks and Image Synthesis被引用 23

一句话总结

本文通过引入离散分类器自由引导和高质量推理策略来改进文本到图像生成中的 VQ-Diffusion，以解决后验和联合分布问题，在多个数据集上实现了最先进的 FID。

ABSTRACT

Vector quantized diffusion (VQ-Diffusion) is a powerful generative model for text-to-image synthesis, but sometimes can still generate low-quality samples or weakly correlated images with text input. We find these issues are mainly due to the flawed sampling strategy. In this paper, we propose two important techniques to further improve the sample quality of VQ-Diffusion. 1) We explore classifier-free guidance sampling for discrete denoising diffusion model and propose a more general and effective implementation of classifier-free guidance. 2) We present a high-quality inference strategy to alleviate the joint distribution issue in VQ-Diffusion. Finally, we conduct experiments on various datasets to validate their effectiveness and show that the improved VQ-Diffusion suppresses the vanilla version by large margins. We achieve an 8.44 FID score on MSCOCO, surpassing VQ-Diffusion by 5.42 FID score. When trained on ImageNet, we dramatically improve the FID score from 11.89 to 4.83, demonstrating the superiority of our proposed techniques.

研究动机与目标

Motivate and address quality gaps in VQ-Diffusion for text-to-image synthesis.
Develop a discrete classifier-free guidance mechanism to enforce alignment with input conditions.
Identify and mitigate the joint distribution issue during sampling with a high-quality inference strategy.
Validate improvements across multiple datasets including MSCOCO, CC, CUB-200, and ImageNet.
Provide open-source code to enable replication and further research.

提出的方法

Propose discrete classifier-free guidance to incorporate posterior constraint and improve conditional generation without sacrificing tractability.
Derive and implement a target combining p(x|y) and p(y|x) to better align outputs with input conditions, including a learnable conditional prior.
Introduce a high-quality inference strategy that reduces the number of tokens sampled per step and uses a purity prior to bias sampling toward high-confidence tokens.
Show that sampling fewer tokens per step mitigates joint-distribution issues and improves sample fidelity.
Leverage a reparameterization approach to estimate discrete token distributions directly during inference.
Evaluate on standard text-to-image benchmarks with ablations to demonstrate the impact of posterior constraint and joint-distribution mitigation.

实验结果

研究问题

RQ1Does incorporating posterior constraint through discrete classifier-free guidance improve the text-image alignment and image quality of VQ-Diffusion?
RQ2Can a high-quality inference strategy that reduces token-level independence and uses a purity prior alleviate joint distribution issues in discrete diffusion models?
RQ3How do the proposed methods affect FID, QS, and CLIP scores across MSCOCO, CC, CUB-200, and ImageNet settings?
RQ4Is the learnable classifier-free guidance more effective than fixed null-vector conditioning in discrete diffusion?
RQ5Do the improvements generalize to large-scale internet-sourced datasets (ITHQ-200M) and balanced CC subsets?

主要发现

Dataset / Setting	MSCOCO FID	CUB-200 FID	CC FID	ITHQ-200M FID
VQ-Diffusion	13.86	10.32	33.65	25.87
VQ-Diffusion + prior	13.79	10.21	33.09	25.15

Discrete classifier-free guidance improves FID and quality scores compared to the baseline VQ-Diffusion on MSCOCO and CC.
A learnable classifier-free guidance vector yields better performance than null conditioning, indicating stronger posterior constraint.
High-quality inference improves sampling when inference steps exceed training steps, with clearer gains as the number of steps increases.
Purity-prior sampling yields FID gains across MSCOCO, CUB-200, CC, and ITHQ-200M without extra training or inference cost.
On MSCOCO, Improved VQ-Diffusion achieves 8.44 FID, surpassing vanilla VQ-Diffusion by 5.42; on ImageNet, FID improves from 11.89 to 4.83 with the proposed methods.
The approach attains state-of-the-art-like results across several datasets and remains compatible with zero-shot or lightly fine-tuned setups

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。