QUICK REVIEW

[논문 리뷰] DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Montgomery Bohde, Mrunali Manjrekar|ArXiv.org|2025. 02. 13.

Analytical Chemistry and Chromatography인용 수 7

한 줄 요약

DiffMS는 질량 스펙트럼에 조건화된 식-제약 확산 기반 분자 생성기로, 스펙트럼 인코더로 트랜스포머를 사용하고 지문–분자 데이터로 사전학습된 이산 그래프 확산 디코더를 활용하여 최첨단의 de novo 생성 성능을 달성한다.

ABSTRACT

Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.

연구 동기 및 목표

LC-MS/MS로부터 구조 해석을 동반한 후보 분자 생성으로 구조 규명 동기를 부여한다.
화학식 제약을 도입하여 합당한 구조에 대한 탐색 공간을 대폭 축소한다.
방대한 지문–구조 데이터의 활용을 위한 사전학습-미세조정 프레임워크를 개발하여 엔드 투 엔드 성능을 향상시킨다.
식-제약을 가진 엔드 투 엔드 DiffMS가 표준 벤치마크에서 베이스라인보다 우수함을 보여준다.

제안 방법

인코더: 피크에 화학식 부여 및 무게 손실을 모델링하는 트랜스포머 기반 스펙트럼 인코더; 스펙트럼 조건 임베딩을 출력한다.
디코더: 화학식 제약 하에서 중원자 그래프를 생성하는 이산 그래프 확산(DiGress 스타일) 디코더; 무작위로 초기화된 인접 행렬을 노이즈 제거한다.
사전학습: 구조 매핑을 학습하기 위해 280만 개의 지문–분자 쌍에 대한 디코더 학습; 스펙트럼으로부터 지문을 예측하도록 인코더의 사전학습.
엔드투엔드 미세조정: 인코더와 확산 디코더를 통합하고 분자–스펙트럼 쌍에서 미세조정한다.
학습 목표: 인접 행렬 노이즈 제거에 대한 교차 엔트로피 손실; 확산 단계의 주변화(marginalization)로 샘플링.
평가: NPLIB1 및 MassSpecGym 벤치마크에서 top-k 정확도, MCES, 그리고 Tanimoto 유사도.

실험 결과

연구 질문

RQ1확산 기반의 식-제약 생성기가 질량 스펙트럼으로부터 plausible de novo 분자를 생성할 수 있는가?
RQ2,

주요 결과

데이터셋	모델	Top-1 정확도	MCES (Top-1)	Tanimoto (Top-1)	Top-10 정확도	MCES (Top-10)	Tanimoto (Top-10)
NPLIB1	Spec2Mol ∗	0.00%	27.82	0.12	0.00%	23.13	0.16
NPLIB1	MADGEN	1.0%	70.45	-	1.0%	45.64	-
NPLIB1	MIST + Neuraldecipher ∗	2.32%	12.11	0.35	6.11%	9.91	0.43
NPLIB1	MIST + MSNovelist ∗	5.40%	14.52	0.34	11.04%	10.23	0.44
NPLIB1	DiffMS	8.34%	11.95	0.35	15.44%	9.23	0.47
MassSpecGym	SMILES Transformer ‡	0.00%	79.39	0.03	0.00%	52.13	0.10
MassSpecGym	MIST + MSNovelist ∗	0.00%	45.55	0.06	0.00%	30.13	0.15
MassSpecGym	SELFIES Transformer ‡	0.00%	38.88	0.08	0.00%	26.87	0.13
MassSpecGym	Spec2Mol ∗	0.00%	37.76	0.12	0.00%	29.40	0.16
MassSpecGym	MIST + Neuraldecipher ∗	0.00%	33.19	0.14	0.00%	31.89	0.16
MassSpecGym	Random Generation ‡	0.00%	21.11	0.08	0.00%	18.26	0.11
MassSpecGym	MADGEN	0.8%	74.19	-	1.6%	53.50	-
MassSpecGym	DiffMS	2.30%	18.45	0.28	4.25%	14.73	0.39

DiffMS는 de novo 구조 해석 벤치마크에서 최첨단 성능을 달성하며, 다양한 지표에서 베이스라인을 능가한다.
NPLIB1에서 DiffMS의 top-1 정확도는 8.34%, top-10 정확도는 15.44%, MCES는 11.95, Tanimoto는 top-k에 따라 0.35–0.47이다.
MassSpecGym에서 DiffMS의 top-1 정확도는 2.30%, top-10 정확도는 4.25%, MCES는 18.45, Tanimoto는 top-k에 따라 0.28–0.39이다.
인코더의 사전학습과 더 큰 디코더 사전학습 데이터셋은 모두 상당한 규모의 이득을 제공하며, 디코더 사전학습이 명확한 성능 스케일링을 보인다.
DiffMS는 정확한 재현이 실패하더라도 근접한 매치를 지속적으로 생성하여 도메인 전문가를 위한 실용적 가이던스로서의 활용성을 검증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.