[论文解读] Denoising Diffusion Probabilistic Models
这篇论文引入扩散概率模型用于高质量图像合成,将它们与去噪分数匹配和朗之万动力学联系起来,并展示在 CIFAR-10 上的 FID 达到最新水平,以及在 CelebA-HQ 与 LSUN 上的高质量样本。
DiffuCpG 1. Introduction In this study, we used a generative AI diffusion model to address missing methylation data. We trained the model with Whole-Genome Bisulfite Sequencing data from 26 acute myeloid leukemia samples and validated it with Reduced Representation Bisulfite Sequencing data from 93 myelodysplastic syndrome and 13 normal samples. Additional testing included data from the Illumina 450k methylation array and Single-Cell Reduced Representation Bisulfite Sequencing on HepG2 cells. Our model, DiffuCpG, outperformed previous methods by integrating a broader range of genomic features, utilizing both short- and long-range interactions without increasing input complexity. It demonstrated superior accuracy, scalability, and versatility across various tissues, diseases, and technologies, providing predictions in both binary and continuous methylation states. In this repository, we deposit the code used to build the diffusion models along with necessary example datasets to train and test a diffusion model for methylation imputation purposes. Docker Usage Install Docker Install Docker using the following link:https://docs.docker.com/engine/install/Recommended system specs: Debian 12 bookworm with 16GB RAM or more.Make sure you have the latest Nvidia GPU driver installed and docker can access your Nvidia GPU. Run Docker images with Tissue-specific Models docker pull yay135/diffucpg_tssUse our example to generate input samples with Hi-C matrix and CIS (Confidence Interval Cross Sample) data.docker run -it yay135/diffucpg_tssthenpython generate_train_test_samples.py The tissue-specific models (pytorch) are for CD34+ cells, GBM and BRCA, they are stored in folders named "model*" in the image. Run the Tissue specific modelsdocker run -it yay135/diffucpg_tssthenpython batch_run.py Run Docker images Example Models docker pull yay135/diffucpgIf you do not have a GPU enabled system, pull a CPU-only imagedocker pull yay135/diffucpg_cpuprepare your input data directory, use the following command to print a example input data directorydocker run --rm yay135/diffucpg -e trueassume your data directory name is "input_data"in windowsdocker run --gpus all -v .\input_data\:/data --rm yay135/diffucpgin unix or linuxdocker run --gpus all -v ./input_data:/data --rm yay135/diffucpg Other docker options -d or --device : select which cuda device to run with, default is 0-m or --mingcpg : scan your methyl array, limit only imputing windows with at least m non-missing methyl values, default is m=10-o or --overlap : set number of impute epochs, shift window locations between epochs, get mean imputed values for each CpG location, default is 2example:docker run --gpus all -v ./input_data:/data --rm yay135/diffucpg -d 1 -m 5 -o 3use cuda device 1, min number of non-missing methyl values in a window is 5, overlap epochs 3 The following tutorials are for non-docker usages. 2. Data and Models Example datasets are available for download using "gdown.sh". The example datasets only contain WGBS methylation data. The model is the DDPM diffusion model, the repository contains a complete implementation for 1-dimensional input. Please refer to https://arxiv.org/abs/2006.11239 and https://huggingface.co/blog/annotated-diffusion for more details. 3. How to use 3.1 System Requirements The number of steps in the diffusion process is set to 2000. Imputing a sample requires 2000 steps. Gpu acceleration is preferred. 16GB of RAM is required. The code is fully tested and operational on the following platform: Distributor ID: DebianDescription: Debian GNU/Linux 12 (bookworm)Release: 12Codename: bookworm 3.2 Clone the Current Project Run the following command to clone the project.git clone https://github.com/yay135/DiffuCpG.git 3.4 Configure Environment Make sure you have the following software installed in your system:Python 3.9+Pytorch 2.0.1+ 3.4 Run Training and Testing python run.pyThe script will download necessary data and install dependencies automatically. 4 Data and Script Details 4.1 RAW Data The methylation arrays downloaded are in the folder "raw", each file is a methylation array. The first 2 columns are "chromosome" and "location". The assembly used for mapping in our project is the "GRCH37 primary assembly". It is also downloaded automatically. The rest of the columns in each file are methylation levels(required) and other biological data (optional) you wish to incorporate to enhance the model. These files in the raw folder are the initial inputs for pipeline,if you wish to use your own data, it must be configured as such before running the pipeline. 4.2 Generate Sample Use script "generate_samples.py" to generate samples for training and testing.The model can not directly read and impute a methylation array file. Instead, each methylation array is divided into windows, each window is 1kb (1000 base pairs) in length, and each training testing sample is generated from a window. Each sample contains at least 5 channels. the first 4 is the sequence one-hot encoding, the 5th is the methylation data. If a base pair location is not a CpG location, the methylation data value for it is "-1". If a CpG's methylation data is missing or waiting for imputaion, its value is also "-1". Other biological data can be added as extra channels. Check out example raw files in the folder "raw" to form your own datasets for training and testing sample generation.For each raw file in the "raw" folder, the first 3 columns are chr, loc, and methylation.The rest of the columns are treated as additional channels and will be added to each sample during generation. '-d' or '--folder': specify raw data folder'-i' or '--index' : which column in a raw file is the methylation array'-t' or '--tol' : how many missing methylation value is tolerated(we recommend 0 for generating training samples and -1 for generating testing samples, 0 will force the script to only select from windows with no missings, -1 will tolerate missing as much as possible.)'-c' or '--chr' : limit which chromosome to use, default is "chr#" to use all chromosomes'-w' or '--winsize' : what window size to use, default is 1000 '-m' or '--mincpg': force generate from window to have a minimum number of CpGs, default is 10 '-n' or '--nsample': number of samples to generate per chromosome '-p' or '--output': samples output folder, default is "out" Use script "generate_samples_concat.py" to generate samples from long-range interacting windows such as Hi-C interactions or computed correlation.Check out the example long range file in the folder "data" to form your own long-range interacting windows for sample generation and concatenation. 4.3 Training Script Use diffusion.py to train and test a DDPM model using the generated samples'-t' or '--train_folder' : the folder containing the training samples'-f' or '--model_folder' : the model folder, will be created if it does not exist'-w' or '--win_size' : window size of each sample, default is 1000'-c' or '--channel': channel size of each sample'-d' or '--cuda_device' : if you have multiple cuda gpus, select which gpu to use, default is 0"-e" or "--epoch" : how many epochs for training, default is 2000"-s" or "--earlystop" : whether to use "early stopping" during training, default is False"-p" or "--patience" : patience for early stopping, default is 10 4.4 Imputation Use diffusion_inpainting.py to perform imputation on generated samples.'-t' or '--test_folder' : the folder containing samples for imputation'-o' or '--out_folder': imputed output folder name, default="inpainting_out"'-w' or '--win_size' : window size of each sample, default is 1000'-c' or '--channel': channel size of each sample'-d' or '--cuda_device' : if you have multiple cuda gpus, select which gpu to use, default is 0 Team If you have any questions or concerns about the project, please contact the following team member: Fengyao Yan fxy134@miami.edu
研究动机与目标
- 以简单、可训练的潜变量模型类别来激励扩散模型用于图像合成。
- 展示去噪分数匹配视角能够产生实用的训练目标和采样过程。
- 在 CIFAR-10、CelebA-HQ 和 LSUN 数据集上演示高质量样本生成。
- 分析扩散模型的速率失真特性和渐进解码能力。
提出的方法
- 定义一个前向马尔可夫链固定的扩散模型,逐步加入高斯噪声。
- 以高斯条件分布参数化反向过程,并将 Sigma 选为固定常数;研究 mu 的参数化。
- 证明 epsilon-预测器反向参数化与带退火 Langevin 动力学的去噪分数匹配等价。
- 引入简化训练目标 L_simple,使其在不同噪声水平上与去噪分数匹配对齐。
- 将最终解码器离散化以实现离散数据似然,并在 CIFAR-10、CelebA-HQ、LSUN 上进行评估。
实验结果
研究问题
- RQ1扩散概率模型是否能在标准基准上产生与现有生成模型相当或更高质量的图像?
- RQ2不同反向过程参数化和训练目标对样本质量与似然性的实际影响?
- RQ3扩散框架如何与去噪分数匹配和 Langevin 动力学相关,这一联系带来哪些好处?
- RQ4扩散模型在有损压缩中的速率-失真特性和渐进解码能力?
- RQ5扩散模型是否通过渐进采样与解码实现自回归式解码的好处?
主要发现
| 模型 | IS | FID | NLL 测试(训练) |
|---|---|---|---|
| Ours (L_simple) | 9.46±0.11 | 3.17 | ≤3.75 (3.72) |
| Ours (L, fixed isotropic Σ) | 7.67±0.13 | 13.51 | ≤3.70 (3.69) |
- 在 CIFAR-10 无条件生成上,使用简化训练目标实现了 Inception Score 9.46 和 FID 3.17。
- 我们最佳模型在无条件 CIFAR-10 结果在样本质量方面超越了许多现有方法,即使与类别条件模型相比也如此。
- 当用变分界训练时,epsilon-预测的反向参数化与 mu-预测相当;使用简化目标时显著更好。
- 该方法在 LSUN 256×256 和高视觉质量的 CelebA-HQ 样本上提供有竞争力或更优的 FID 分数。
- 模型的速率-失真分析显示大多数无损位描述了难以察觉的图像细节,表明具备强烈的渐进有损编码特性。
- 渐进解码实验表明在采样过程中大尺度图像特征较早出现,细节在后续出现。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。