QUICK REVIEW

[论文解读] Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini, Shenyang Huang|arXiv (Cornell University)|Jan 1, 2023

Machine Learning in Materials Science被引用 6

一句话总结

本论文提出了一类新型大规模多任务分子数据集——ToyMix、LargeMix 和 UltraLarge，涵盖近 10000 万种分子，以及超过 130 亿个量子和生物性质的标注，涉及 3000 项任务。作者提出了 Graphium，一个专为高效多任务、多层级图学习设计的深度学习库，并证明在多样化、有监督的数据上进行预训练可提升低资源生物任务的性能，支持了分子 AI 中基础模型的可行性。

ABSTRACT

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

研究动机与目标

为解决分子机器学习中训练基础模型所必需的大规模、有标注、多任务分子数据集的缺乏问题。
通过引入来自量子力学和湿实验生物测定的有监督、多模态标签，克服自监督预训练的局限性。
通过开发 Graphium 深度学习库，实现在大规模异构分子数据集上的高效训练。
建立强基准模型和实证证据，表明在多样化分子性质上进行多任务、多层级预训练可提升低资源下游任务的性能。

提出的方法

通过整合基于 DFT（如 B3LYP）和半经验方法（如 PM6）计算的量子力学（QM）性质，整理并增强现有的分子数据集。
从高通量测定中收集生物活性标签，包括剂量-反应轮廓、基因表达和毒理学数据，以创建多层级（节点级和图级）标签。
设计了三类数据集——ToyMix、LargeMix 和 UltraLarge，覆盖从小规模到接近 PubChem 完整覆盖的范围，总计 130.4 亿个标签。
开发了 Graphium，一个基于 PyTorch 的库，专为多任务、多层级图学习优化，支持混合精度训练、模型流水线和分布式推理。
采用消息传递神经网络和变换器实现基线模型，在完整数据集层级上进行训练，以评估迁移学习性能。
在量子和生物任务中结合回归与分类目标，以最大化信息量并实现有效的预训练。

实验结果

研究问题

RQ1大规模、多任务、多层级的分子数据集，同时包含量子和生物标签，是否能有效支持分子机器学习中基础模型的预训练？
RQ2与自监督或单任务预训练相比，在多样化、有监督数据上进行预训练是否能提升低资源生物性质预测任务的性能？
RQ3多任务和多层级训练目标在多大程度上增强了分子建模任务中的泛化能力和可迁移性？
RQ4与现有基准（如 OGB-LSC 和 QM1B）相比，所提出数据集中标注数据的规模和丰富度在数据量和标签丰富度方面有何差异？
RQ5像 Graphium 这样的统一深度学习库，是否能有效支持在多种硬件平台上对如此大规模、异构的分子数据集进行训练和推理？

主要发现

所提出的数据集包含的数据点数量是广泛使用的 OGB-LSC PCQM4Mv2 数据集的 300 倍，是仅含量子的 QM1B 数据集的 13 倍。
数据集涵盖近 1 亿种分子，超过 3000 项稀疏定义的任务，总计超过 130 亿个量子和生物性质的独立标签。
基线结果表明，当预训练包含大量量子数据时，低资源生物数据集的微调性能显著提升，表明存在强大的迁移学习潜力。
Graphium 库支持在大规模多任务数据集上高效训练，支持混合精度和跨多个加速器的分布式训练。
所提出数据集中标签数量（130.4 亿）接近 GPT-2 等基础 NLP 模型所使用的预训练数据规模，表明其在分子表征学习中具备可比的预训练能力。
联合建模量子和生物性质可增强模型泛化能力，支持了多样化、有监督预训练对构建高效分子基础模型至关重要的假设。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。