QUICK REVIEW

[论文解读] DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Montgomery Bohde, Mrunali Manjrekar|ArXiv.org|Feb 13, 2025

Analytical Chemistry and Chromatography被引用 7

一句话总结

DiffMS 是一个在质量光谱条件下的、受公式约束的扩散式分子生成器，使用变压器谱编码器和一个在指纹–分子数据上预训练的离散图扩散解码器，以实现最先进的全新分子生成性能。

ABSTRACT

Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.

研究动机与目标

通过谱图条件生成候选分子来推动 LC-MS/MS 的结构解析。
引入化学式约束，以显著减少对合理结构的搜索空间。
开发预训练-微调框架，以利用大量指纹–结构数据提升端到端性能。
证明端到端的 DiffMS 结合公式约束在标准基准上优于基线。

提出的方法

编码器：基于变压器的谱编码器，对峰值分配化学式并建模净损失；输出一个谱条件嵌入。
解码器：离散图扩散（DiGress 风格），在化学式约束下生成重原子图；对随机初始化的邻接矩阵进行去噪。
预训练：解码器在 280 万指纹–分子对上进行训练以学习结构映射；编码器预训练以从光谱预测指纹。
端到端微调：整合编码器和扩散解码器，在分子–光谱对上进行微调。
训练目标：对邻接矩阵去噪的交叉熵损失；通过扩散步的边际化进行采样。
评估：在 NPLIB1 和 MassSpecGym 基准上的 Top-k 准确率、MCES 和 Tanimoto 相似度。

实验结果

研究问题

RQ1 diffusion-based、formula-constrained 生成器是否能够从质量光谱产生可行的 de novo 分子？
RQ2指纹–结构数据的预训练在端到端性能上提升有多大？
RQ3将光谱推导的化学式约束纳入是否在结构准确性和相似性方面优于基线方法？

主要发现

数据集	模型	Top-1 准确率	MCES (Top-1)	Tanimoto (Top-1)	Top-10 准确率	MCES (Top-10)	Tanimoto (Top-10)
NPLIB1	Spec2Mol ∗	0.00%	27.82	0.12	0.00%	23.13	0.16
NPLIB1	MADGEN	1.0%	70.45	-	1.0%	45.64	-
NPLIB1	MIST + Neuraldecipher ∗	2.32%	12.11	0.35	6.11%	9.91	0.43
NPLIB1	MIST + MSNovelist ∗	5.40%	14.52	0.34	11.04%	10.23	0.44
NPLIB1	DiffMS	8.34%	11.95	0.35	15.44%	9.23	0.47
MassSpecGym	SMILES Transformer ‡	0.00%	79.39	0.03	0.00%	52.13	0.10
MassSpecGym	MIST + MSNovelist ∗	0.00%	45.55	0.06	0.00%	30.13	0.15
MassSpecGym	SELFIES Transformer ‡	0.00%	38.88	0.08	0.00%	26.87	0.13
MassSpecGym	Spec2Mol ∗	0.00%	37.76	0.12	0.00%	29.40	0.16
MassSpecGym	MIST + Neuraldecipher ∗	0.00%	33.19	0.14	0.00%	31.89	0.16
MassSpecGym	Random Generation ‡	0.00%	21.11	0.08	0.00%	18.26	0.11
MassSpecGym	MADGEN	0.8%	74.19	-	1.6%	53.50	-
MassSpecGym	DiffMS	2.30%	18.45	0.28	4.25%	14.73	0.39

DiffMS 在 de novo 结构解析基准上达到最先进的性能，在各指标上均优于基线。
在 NPLIB1 上，DiffMS 的 Top-1 准确率为 8.34%，Top-10 为 15.44%，MCES 为 11.95，Tanimoto 在 Top-k 之间为 0.35–0.47。
在 MassSpecGym 上，DiffMS 的 Top-1 为 2.30%，Top-10 为 4.25%，MCES 为 18.45，Tanimoto 在 Top-k 之间为 0.28–0.39。
编码器预训练和更大规模的解码器预训练数据集均带来显著、可扩展的增益，解码器预训练显示出清晰的性能扩展。
DiffMS 即使在无法完全回收的情况下，也能持续给出接近的匹配，验证其对领域专家的实际指导价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。