QUICK REVIEW

[論文レビュー] DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Montgomery Bohde, Mrunali Manjrekar|ArXiv.org|Feb 13, 2025

Analytical Chemistry and Chromatography被引用数 7

ひとこと要約

DiffMS は式制約付きの拡散ベースの分子生成モデルで、質量スペクトルに条件付けられ、トランスフォーマースペクトラムエンコーダと指紋–分子データで事前学習された離散グラフ拡散デコーダを用いて、最先端の新規デノボ生成性能を達成します。

ABSTRACT

Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.

研究の動機と目的

LC-MS/MS からの構造解明をスペクトル conditioned で生成候補分子を用いて動機づける。
化学式制約を取り入れ、もっともあり得る構造の探索空間を抜本的に削減する。
大量の指紋–構造データを活用するための事前学習-微調整フレームワークを開発し、エンドツーエンドの性能を向上させる。
式制約を用いたエンドツーエンド DiffMS が標準ベースラインよりもベンチマークで上回ることを示す。

提案手法

エンコーダ: peaks に化学式を割り当て、ニュートラルロスをモデル化するトランスフォーマー基盤のスペクトラムエンコーダ。スペクトラム conditioning を持つ埋め込みを出力する。
デコーダ: 弊害的グラフ拡散（DiGress 風）で、化学式制約の下で重原子グラフを生成する。随机に初期化された隣接行列をデノイズする。
事前学習: デコーダは 280 万組の指 fingerprint–分子ペア上で構造マッピングを学習; エンコーダはスペクトルから指紋を予測するように事前学習。
エンドツーエンド微調整: エンコーダと拡散デコーダを統合し、分子–スペクトルのペアで微調整を行う。
学習目的: 隣接行列デノイズに対するクロスエントロピー損失；拡散ステップの周辺化によるサンプリング。
評価: NPLIB1 および MassSpecGym ベンチマークでトップk精度、MCES、Tanimoto類似度を評価。

実験結果

リサーチクエスチョン

RQ1質量スペクトルから plausible な新規分子を拡散ベースの式制約付きジェネレータで生成できるか？
RQ2指 fingerprint–構造データの事前学習がエンドツーエンドの性能をどれだけ向上させるか？
RQ3スペクトル由来の式制約を組み込むことは、基準法と比較して構造的正確さと類似性を向上させるか？

主な発見

Dataset	Model	Top-1 Accuracy	MCES (Top-1)	Tanimoto (Top-1)	Top-10 Accuracy	MCES (Top-10)	Tanimoto (Top-10)
NPLIB1	Spec2Mol ∗	0.00%	27.82	0.12	0.00%	23.13	0.16
NPLIB1	MADGEN	1.0%	70.45	-	1.0%	45.64	-
NPLIB1	MIST + Neuraldecipher ∗	2.32%	12.11	0.35	6.11%	9.91	0.43
NPLIB1	MIST + MSNovelist ∗	5.40%	14.52	0.34	11.04%	10.23	0.44
NPLIB1	DiffMS	8.34%	11.95	0.35	15.44%	9.23	0.47
MassSpecGym	SMILES Transformer ‡	0.00%	79.39	0.03	0.00%	52.13	0.10
MassSpecGym	MIST + MSNovelist ∗	0.00%	45.55	0.06	0.00%	30.13	0.15
MassSpecGym	SELFIES Transformer ‡	0.00%	38.88	0.08	0.00%	26.87	0.13
MassSpecGym	Spec2Mol ∗	0.00%	37.76	0.12	0.00%	29.40	0.16
MassSpecGym	MIST + Neuraldecipher ∗	0.00%	33.19	0.14	0.00%	31.89	0.16
MassSpecGym	Random Generation ‡	0.00%	21.11	0.08	0.00%	18.26	0.11
MassSpecGym	MADGEN	0.8%	74.19	-	1.6%	53.50	-
MassSpecGym	DiffMS	2.30%	18.45	0.28	4.25%	14.73	0.39

DiffMS はデノボ構造解明ベンチマークで最先端の性能を達成し、指標全体でベースラインを上回る。
NPLIB1 では DiffMS は top-1 精度 8.34%、top-10 精度 15.44%、MCES 11.95、Tanimoto 0.35–0.47（top-k に依存）を達成。
MassSpecGym では DiffMS は top-1 精度 2.30%、top-10 精度 4.25%、MCES 18.45、Tanimoto 0.28–0.39（top-k に依存）を達成。
エンコーダの事前学習とより大規模なデコーダ事前学習データセットの双方が、実質的でスケーラブルなゲインを生み出し、デコーダの事前学習が明確な性能スケーリングを示す。
DiffMS は厳密な回復に失敗しても近似一致を一貫して生成し、ドメイン専門家へのガイダンスとしての実用性を裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。