QUICK REVIEW

[论文解读] Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

Ting Chen, Ruixiang Zhang|arXiv (Cornell University)|Aug 8, 2022

Multimodal Machine Learning Applications被引用 79

一句话总结

该论文提出 Bit Diffusion，一种通过在连续扩散模型中将二进制位建模为模拟实数来生成离散数据的方法，以及自条件化和非对称时间间隔以提升采样质量；它在离散图像生成方面达到最新水平，并在图像字幕方面也具有竞争力。

ABSTRACT

We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous state and continuous time diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model first generates the analog bits, which are then thresholded to obtain the bits that represent the discrete variables. We further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals, which lead to a significant improvement in sample quality. Despite its simplicity, the proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For discrete image generation, we significantly improve previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens) and ImageNet-64x64 (which has 12K discrete 8-bit tokens), outperforming the best autoregressive model in both sample quality (measured by FID) and efficiency. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.

研究动机与目标

激励克服自回归模型在离散数据上的局限性（可扩展性和生成速度）。
提出一种简单、通用的方法，通过模拟位将连续扩散模型用于离散数据。
通过自条件化和非对称时间间隔改进基于扩散的离散数据生成。
在离散图像生成（Cifar-10、ImageNet 64×64）上展示出色性能，并在 MS-COCO 上的图像字幕结果具有竞争力。

提出的方法

将离散数据表示为二进制位并将其映射到实值模拟位以进行连续扩散建模。
训练扩散模型对模拟位进行去噪，使用位表示上的 L2 损失。
通过对模拟位进行阈值化解码样本以恢复离散变量。
通过将去噪器以先前生成的 x0 估计作为条件引入自条件化，以提高样本质量。
在采样中应用非对称时间间隔，使用非相等的时间步（td 参数）来改善去噪，特别是在步数较少时。
对离散像素使用 U-Net 架构和二进制编码方案（uint8、Gray 码、uint8 rand），并为字幕使用每个令牌 15 个模拟位的 SentencePiece 分词。

实验结果

研究问题

RQ1当离散变量被编码为模拟位时，连续状态扩散模型是否能够可靠地生成离散数据？
RQ2自条件化和非对称时间间隔是否在图像和文本任务中提高 Bit Diffusion 的样本质量？
RQ3Bit Diffusion 在离散图像生成和基于图像的字幕生成中对比自回归模型的表现如何？
RQ4哪些离散数据的编码方案（uint8、gray code、uint8 rand）在性能和复杂度之间能达到最佳权衡？

主要发现

Bit Diffusion 在使用模拟位和 100–1000 步采样的情况下，在离散 CIFAR-10 生成上达到最新的 FID，在 ImageNet 64×64 上也取得了强劲结果。
在 CIFAR-10 上，使用 uint8 编码的 Bit Diffusion 达到 6.93 的 FID（类别像素），超过自回归模型。
对于 ImageNet 64×64，连续像素扩散模型仍然最好，离散变体（uint8、gray code、uint8 rand）显示出具竞争力的 FID，例如在类别条件设定下，uint8 为 4.84，而连续像素为 3.43。
在 MS-COCO 的图像字幕任务中，使用随机初始化解码器的 Bit Diffusion 与自回归基线相比，BLEU/ROUGE/CIDEr 得分具有竞争力，尤其在采样步数增加（10–40 步）时。
自条件化在离散和连续扩散任务中均能稳定提升性能，非对称时间间隔在较少采样步数时尤其带来增益。
生成的模拟位收敛于双峰分布，便于通过阈值化稳健地恢复离散变量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。