QUICK REVIEW

[论文解读] stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data Imputation

Xiaoyu Li, Wenwen Min|arXiv (Cornell University)|Mar 16, 2024

Gene expression and cancer classification被引用 6

一句话总结

stMCDI 将图神经编码器与条件分数扩散模型相结合，并采用掩蔽自监督策略，在时空转录组数据中填充缺失值，同时尽量保持数据分布，在多个数据集上实现了最先进的结果。

ABSTRACT

Spatially resolved transcriptomics represents a significant advancement in single-cell analysis by offering both gene expression data and their corresponding physical locations. However, this high degree of spatial resolution entails a drawback, as the resulting spatial transcriptomic data at the cellular level is notably plagued by a high incidence of missing values. Furthermore, most existing imputation methods either overlook the spatial information between spots or compromise the overall gene expression data distribution. To address these challenges, our primary focus is on effectively utilizing the spatial location information within spatial transcriptomic data to impute missing values, while preserving the overall data distribution. We introduce extbf{stMCDI}, a novel conditional diffusion model for spatial transcriptomics data imputation, which employs a denoising network trained using randomly masked data portions as guidance, with the unmasked data serving as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial position information, thereby enhancing model performance. The results obtained from spatial transcriptomics datasets elucidate the performance of our methods relative to existing approaches.

研究动机与目标

利用空间位置信息在不扭曲基因表达分布的前提下改善时空转录组数据的填充。
通过对数据部分进行掩蔽以创建伪标签并指导学习，开发自监督训练策略。
将图神经网络编码器与条件分数扩散模型整合，以实现鲁棒填充。
在多个真实世界的时空转录组数据集上展示最先进的性能。

提出的方法

利用每个spot的五个最近邻点，从空间坐标构建一个spot图，形成邻接矩阵。
使用图卷积网络对空间和表达信息进行编码，以获得潜在的spot表示。
通过随机掩蔽数据部分来应用掩蔽自监督方案，然后对潜在表示进行重新掩蔽以引导基于扩散的去噪。
使用条件分数扩散模型，其中已知（未掩蔽）数据作为先验条件来填补缺失值。
用跨注意力增强UNet骨干，以整合条件信息并学习数据分布的梯度。
通过针对条件扩散定制的变分下界进行优化，损失函数鼓励对被掩蔽值的准确重构。

实验结果

研究问题

RQ1通过GNN编码器引入空间位置信息，是否能比忽略空间上下文的方法得到更好的填充质量？
RQ2结合掩蔽自监督训练策略与条件扩散模型，是否能在不扭曲整体数据分布的前提下可靠地填充缺失值？
RQ3图神经网络体系结构的选择如何影响时空转录组数据的填充性能？
RQ4不同掩蔽策略和掩蔽阶段对填充精度有何影响？
RQ5来自多样组织和物种的真实时空转录组数据集，是否比现有基线更适合用stMCDI进行填充？

主要发现

stMCDI 在六个真实世界的时空转录组数据集上，在四个评估指标（PCC、Cosine、RMSE、MAE）上超越了十四个基线方法。
该方法以持续的最佳性能达成，在若干数据集（例如 MOB、HBC、HP、HO、ML、MK）上取得显著提升。
消融研究显示双掩蔽策略的有效性，以及在该任务中使用GCN编码器相较于其他GNN变体的重要性。
GCN 被证明在整合空间与表达信息方面是测试选项中最有效的图编码器。
使用已知数据作为先验的条件扩散框架提高了与数据分布的对齐度并提升了填充准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。