[论文解读] From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction
本工作提出 Joint Multi-domain Pre-training (JMP),在跨越多种化学领域训练单一模型,达到相对于从零训练的平均 59% 提升,并在 40 项任务中达到或超过最先进水平的 34 项。
Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains. To address this, we introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task framework. Our combined training dataset consists of $\sim$120M systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and generalization by fine-tuning over a diverse set of downstream tasks and datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP demonstrates an average improvement of 59% over training from scratch, and matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the potential of pre-training strategies that utilize diverse data to advance property prediction across chemical domains, especially for low-data tasks. Please visit https://nima.sh/jmp for further information.
研究动机与目标
- Motivate and address the challenge of generalizing atomic property prediction across diverse chemical domains.
- Develop a scalable pre-training strategy that leverages large, heterogeneous datasets.
- Enable fine-tuning on downstream tasks with limited data while maintaining strong performance.
- Demonstrate transferability to unseen domains (large molecules and materials) beyond the pre-training domains.
提出的方法
- Propose Joint Multi-domain Pre-training (JMP) as a multi-task supervised pre-training framework.
- Use a single backbone model (GemNet-OC) with per-dataset prediction heads for energies and forces.
- Normalize targets with linear energy referencing and force normalization to unit Gaussian per dataset.
- Apply temperature-based sampling to balance dataset sizes during batch construction.
- Introduce structure-wise loss reduction to balance contributions from datasets with different system sizes.
- Adopt unitary scalarization for multi-task loss with regularization (weight decay, edge dropout, EMA).
- Fine-tune by replacing pre-training heads with task-specific heads and optionally compute forces via energy gradients.
实验结果
研究问题
- RQ1How well does a single pre-trained model trained on multiple chemical domains generalize to downstream tasks across small molecules, large molecules, and materials?
- RQ2Does joint multi-domain pre-training outperform training from scratch and previous single-domain or self-supervised approaches on diverse benchmarks?
- RQ3What are the effects of data balance, loss formulation, and regularization strategies on multi-task pre-training performance?
- RQ4Can JMP enable large-model fine-tuning with limited downstream data and improve transfer to unseen domains?
主要发现
- JMP yields an average 59% improvement over training from scratch on fine-tuning tasks.
- JMP matches or sets state-of-the-art on 34 out of 40 fine-tuning tasks across QM9, rMD17, MD22, SPICE, MatBench, and QMOF.
- A 235M-parameter JMP model achieves state-of-the-art performance on multiple low-data benchmarks.
- Fine-tuning JMP-L reaches the performance of GN-OC-L in about 1/12 the training time, indicating faster adaptation.
- Pre-training on diverse chemical data provides transferable representations that generalize to non-equilibrium configurations and out-of-domain targets (e.g., materials properties in MatBench and QMOF).
- The full JMP pre-training cost is offset by over 12x faster downstream fine-tuning compared to training from scratch.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。