Skip to main content
QUICK REVIEW

[논문 리뷰] Denoising Diffusion Probabilistic Models

Yan, Steven|arXiv (Cornell University)|2020. 06. 19.
Generative Adversarial Networks and Image Synthesis참고 문헌 70인용 수 5,577
한 줄 요약

이 논문은 확산 확률 모델을 고품질 이미지 합성에 도입하고, 이를 잡음 제거 점수 매칭 및 Langevin 다이나믹스와 연결하며, CIFAR-10 FID에서 최첨단 성능과 CelebA-HQ 및 LSUN에서 강력한 샘플 품질을 보여준다.

ABSTRACT

DiffuCpG 1. Introduction In this study, we used a generative AI diffusion model to address missing methylation data. We trained the model with Whole-Genome Bisulfite Sequencing data from 26 acute myeloid leukemia samples and validated it with Reduced Representation Bisulfite Sequencing data from 93 myelodysplastic syndrome and 13 normal samples. Additional testing included data from the Illumina 450k methylation array and Single-Cell Reduced Representation Bisulfite Sequencing on HepG2 cells. Our model, DiffuCpG, outperformed previous methods by integrating a broader range of genomic features, utilizing both short- and long-range interactions without increasing input complexity. It demonstrated superior accuracy, scalability, and versatility across various tissues, diseases, and technologies, providing predictions in both binary and continuous methylation states. In this repository, we deposit the code used to build the diffusion models along with necessary example datasets to train and test a diffusion model for methylation imputation purposes. Docker Usage Install Docker Install Docker using the following link:https://docs.docker.com/engine/install/Recommended system specs: Debian 12 bookworm with 16GB RAM or more.Make sure you have the latest Nvidia GPU driver installed and docker can access your Nvidia GPU. Run Docker images with Tissue-specific Models docker pull yay135/diffucpg_tssUse our example to generate input samples with Hi-C matrix and CIS (Confidence Interval Cross Sample) data.docker run -it yay135/diffucpg_tssthenpython generate_train_test_samples.py The tissue-specific models (pytorch) are for CD34+ cells, GBM and BRCA, they are stored in folders named "model*" in the image. Run the Tissue specific modelsdocker run -it yay135/diffucpg_tssthenpython batch_run.py Run Docker images Example Models docker pull yay135/diffucpgIf you do not have a GPU enabled system, pull a CPU-only imagedocker pull yay135/diffucpg_cpuprepare your input data directory, use the following command to print a example input data directorydocker run --rm yay135/diffucpg -e trueassume your data directory name is "input_data"in windowsdocker run --gpus all -v .\input_data\:/data --rm yay135/diffucpgin unix or linuxdocker run --gpus all -v ./input_data:/data --rm yay135/diffucpg Other docker options -d or --device : select which cuda device to run with, default is 0-m or --mingcpg : scan your methyl array, limit only imputing windows with at least m non-missing methyl values, default is m=10-o or --overlap : set number of impute epochs, shift window locations between epochs, get mean imputed values for each CpG location, default is 2example:docker run --gpus all -v ./input_data:/data --rm yay135/diffucpg -d 1 -m 5 -o 3use cuda device 1, min number of non-missing methyl values in a window is 5, overlap epochs 3 The following tutorials are for non-docker usages. 2. Data and Models Example datasets are available for download using "gdown.sh". The example datasets only contain WGBS methylation data. The model is the DDPM diffusion model, the repository contains a complete implementation for 1-dimensional input. Please refer to https://arxiv.org/abs/2006.11239 and https://huggingface.co/blog/annotated-diffusion for more details. 3. How to use 3.1 System Requirements The number of steps in the diffusion process is set to 2000. Imputing a sample requires 2000 steps. Gpu acceleration is preferred. 16GB of RAM is required. The code is fully tested and operational on the following platform: Distributor ID: DebianDescription: Debian GNU/Linux 12 (bookworm)Release: 12Codename: bookworm 3.2 Clone the Current Project Run the following command to clone the project.git clone https://github.com/yay135/DiffuCpG.git 3.4 Configure Environment Make sure you have the following software installed in your system:Python 3.9+Pytorch 2.0.1+ 3.4 Run Training and Testing python run.pyThe script will download necessary data and install dependencies automatically. 4 Data and Script Details 4.1 RAW Data The methylation arrays downloaded are in the folder "raw", each file is a methylation array. The first 2 columns are "chromosome" and "location". The assembly used for mapping in our project is the "GRCH37 primary assembly". It is also downloaded automatically. The rest of the columns in each file are methylation levels(required) and other biological data (optional) you wish to incorporate to enhance the model. These files in the raw folder are the initial inputs for pipeline,if you wish to use your own data, it must be configured as such before running the pipeline. 4.2 Generate Sample Use script "generate_samples.py" to generate samples for training and testing.The model can not directly read and impute a methylation array file. Instead, each methylation array is divided into windows, each window is 1kb (1000 base pairs) in length, and each training testing sample is generated from a window. Each sample contains at least 5 channels. the first 4 is the sequence one-hot encoding, the 5th is the methylation data. If a base pair location is not a CpG location, the methylation data value for it is "-1". If a CpG's methylation data is missing or waiting for imputaion, its value is also "-1". Other biological data can be added as extra channels. Check out example raw files in the folder "raw" to form your own datasets for training and testing sample generation.For each raw file in the "raw" folder, the first 3 columns are chr, loc, and methylation.The rest of the columns are treated as additional channels and will be added to each sample during generation. '-d' or '--folder': specify raw data folder'-i' or '--index' : which column in a raw file is the methylation array'-t' or '--tol' : how many missing methylation value is tolerated(we recommend 0 for generating training samples and -1 for generating testing samples, 0 will force the script to only select from windows with no missings, -1 will tolerate missing as much as possible.)'-c' or '--chr' : limit which chromosome to use, default is "chr#" to use all chromosomes'-w' or '--winsize' : what window size to use, default is 1000 '-m' or '--mincpg': force generate from window to have a minimum number of CpGs, default is 10 '-n' or '--nsample': number of samples to generate per chromosome '-p' or '--output': samples output folder, default is "out" Use script "generate_samples_concat.py" to generate samples from long-range interacting windows such as Hi-C interactions or computed correlation.Check out the example long range file in the folder "data" to form your own long-range interacting windows for sample generation and concatenation. 4.3 Training Script Use diffusion.py to train and test a DDPM model using the generated samples'-t' or '--train_folder' : the folder containing the training samples'-f' or '--model_folder' : the model folder, will be created if it does not exist'-w' or '--win_size' : window size of each sample, default is 1000'-c' or '--channel': channel size of each sample'-d' or '--cuda_device' : if you have multiple cuda gpus, select which gpu to use, default is 0"-e" or "--epoch" : how many epochs for training, default is 2000"-s" or "--earlystop" : whether to use "early stopping" during training, default is False"-p" or "--patience" : patience for early stopping, default is 10 4.4 Imputation Use diffusion_inpainting.py to perform imputation on generated samples.'-t' or '--test_folder' : the folder containing samples for imputation'-o' or '--out_folder': imputed output folder name, default="inpainting_out"'-w' or '--win_size' : window size of each sample, default is 1000'-c' or '--channel': channel size of each sample'-d' or '--cuda_device' : if you have multiple cuda gpus, select which gpu to use, default is 0 Team If you have any questions or concerns about the project, please contact the following team member: Fengyao Yan fxy134@miami.edu

연구 동기 및 목표

  • 이미지 합성을 위한 간단하고 학습 가능한 잠재 변수 모델 계열로 확산 모델을 동기 부여한다.
  • 잡음 제거-점수 매칭 관점이 실용적인 학습 목표와 샘플링 절차를 도출한다.
  • CIFAR-10, CelebA-HQ, 및 LSUN 데이터세트의 고품질 샘플 생성을 보여준다.
  • 확산 모델의 속도-손실 특성과 점진적 디코딩 능력을 분석한다.

제안 방법

  • 고정된 순방향 마르코프 체인을 정의하여 점차 가우시안 잡음을 추가한다.
  • 역과정을 가우시안 조건부로 매개하고 Sigma를 고정 상수로 선택하며 mu 매개화를 탐구한다.
  • epsilon-예측 역 매개화가 annealed Langevin dynamics와 함께 잡음 제거 점수 매칭과 동등함을 보인다.
  • 노이즈 레벨 전체에 걸쳐 잡음 제거 점수 매칭과 일치하는 단순화된 학습 목표 L_simple를 도입한다.
  • 최종 디코더를 이산 데이터 우도(discrete data likelihoods) 가능하게 해 CIFAR-10, CelebA-HQ, LSUN에서 평가한다.

실험 결과

연구 질문

  • RQ1확산 확률 모델이 표준 벤치마크에서 기존 생성 모델과 비교해 고품질 이미지를 생성할 수 있는가?
  • RQ2다른 역과정 매개화 및 학습 목표가 샘플 품질과 우도에 어떤 실질적 영향을 미치는가?
  • RQ3확산 프레임워크가 잡음 제거 점수 매칭 및 Langevin 다이나믹스와 어떤 관련이 있으며 이 연결에서 어떤 이점이 생기는가?
  • RQ4손실 압축에 대한 확산 모델의 레이트-디스토션 특성과 점진적 디코딩 능력은 어떠한가?
  • RQ5확산 모델은 점진적 샘플링 및 디코딩을 통해 자동 회귀같은 디코딩 이점을 지원하는가?

주요 결과

모델ISFIDNLL 테스트 (트레인)
Ours (L_simple)9.46±0.113.17≤3.75 (3.72)
Ours (L, fixed isotropic Σ)7.67±0.1313.51≤3.70 (3.69)
  • 단순 학습 목표를 사용한 CIFAR-10 무조건 생성에서 Inception Score 9.46 및 FID 3.17을 달성했다.
  • 우리의 최적 모델로 CIFAR-10 무조건 결과는 샘플 품질에서 기존 많은 방법들을 능가하며 클래스 조건화 모델과 비교해도 여전히 우수하다.
  • epsilon-예측 역 매개화는 변분 한계와 함께 훈련될 때 mu-예측에 비해 비슷하게 작동하며 단순화된 목표일 때 크게 더 우수하다.
  • 이 방법은 LSUN 256×256 및 CelebA-HQ 샘플에서 시각적으로 높은 품질로 경쟁력 있는 FID 점수를 낸다.
  • 모델의 레이트-디스토션 분석은 대부분의 무손실 비트가 감지할 수 없는 이미지 세부 정보를 설명한다는 것을 보여주며 강한 점진적 손실 코딩 행동을 시사한다.
  • 점진적 디코딩 실험은 대규모 이미지 특징이 샘플링 과정에서 일찍 나타나고 미세한 디테일은 나중에 나타남을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.