[论文解读] Leffingwell Odor Dataset
本论文在精选的专家标注 QSOR 数据集上训练图神经网络,以从分子图预测气味描述,建立一个学习到的气味空间,并展示对相关任务的迁移性。
<strong>NOTE: It's easier to download this dataset from pyrfume. Here's how:</strong> <pre><code># First install pyrfume in your Python environment. This can be done easily with pip. # pip install pyrfume import pyrfume molecules = pyrfume.load_data('leffingwell/molecules.csv', remote=True) behavior = pyrfume.load_data('leffingwell/behavior.csv', remote=True) # e.g. to count the number of molecules with each descriptor behavior.sum().sort_values(ascending=False).astype(int) </code></pre> Predicting properties of molecules is an area of growing research in machine learning, particularly as models for learning from graph-valued inputs improve in sophistication and robustness. A molecular property prediction problem that has received comparatively little attention during this surge in research activity is building Structure-Odor Relationships (SOR) models (as opposed to Quantitative Structure-Activity Relationships, a term from medicinal chemistry). This is a 70+ year-old problem straddling chemistry, physics, neuroscience, and machine learning. To spur development on the SOR problem, we curated and cleaned a dataset of 3523 molecules associated with expert-labeled odor descriptors from the <em>Leffingwell PMP 2001</em> database. We provide featurizations of all molecules in the dataset using bit-based and count-based fingerprints, Mordred molecular descriptors, and the embeddings from our trained GNN model (Sanchez-Lengeling et al., 2019). This dataset is comprised of two files: <strong>leffingwell_data.csv</strong>: this contains molecular structures, and what they smell like, along with train, test, and cross-validation splits. More detail on the file structure is found in leffingwell_readme.pdf. <strong>leffingwell_embeddings.npz</strong>: this contains several featurizations of the molecules in the dataset. <strong>leffingwell_readme.pdf</strong>: a more detailed description of the data and its provenance, including expected performance metrics. <strong>LICENSE</strong>: a copy of the CC-BY-NC license language. The dataset, and all associated features, is freely available for research use under the CC-BY-NC license. If you use the data in a publication, please cite: <pre>@article{sanchez2019machine, title={Machine learning for scent: Learning generalizable perceptual representations of small molecules}, author={Sanchez-Lengeling, Benjamin and Wei, Jennifer N and Lee, Brian K and Gerkin, Richard C and Aspuru-Guzik, Al{\'a}n and Wiltschko, Alexander B}, journal={arXiv preprint arXiv:1910.10685}, year={2019} }</pre>
研究动机与目标
- 将 QSOR 作为一个具有化学和神经科学跨学科的长期挑战问题来激励。
- 通过对香水数据库中的描述进行标准化,创建一个大型的专家标注气味数据集。
- 证明图神经网络比传统基线在从分子图预测气味描述方面更有效。
- 展示学习到的气味嵌入能够捕捉感知结构并支持迁移学习到新的气味描述。
提出的方法
- 将分子表示为以原子为节点、以键为边的图。
- 训练图神经网络以同时预测 138 个气味描述(多标签分类)。
- 使用 RDKit 位指纹、Morgan 指纹和 Mordred 特征,将 GNN 与基线(随机森林和 k-NN)进行比较。
- 使用倒数第二层输出作为固定维度的气味嵌入,用于全局和局部结构分析。
- 以 AUROC、精度和 F1 进行评估,并报告基于自助法的置信区间。
- 提供附录,包含超参数调优细节和体系架构变体(GCN 与 MPNN)。
实验结果
研究问题
- RQ1GNN 是否能够从分子图中学习一个可推广到多种气味描述的通用气味表征?
- RQ2学习到的气味嵌入是否在全局上反映感知关系(按气味组的簇集)以及在局部上反映感知相似的邻居?
- RQ3GNN 嵌入是否可迁移用于预测未见或新定义的气味描述?
- RQ4气味嵌入是否可迁移到与嗅觉预测相关的任务,超出训练数据集?
- RQ5在多描述符场景下,基于 GNN 的 QSOR 性能如何与传统基于特征的基线相比?
主要发现
| Model | AUROC (mean [CI]) | Precision (mean [CI]) | F1 (mean [CI]) |
|---|---|---|---|
| GNN | 0.894 [0.888, 0.902] | 0.379 [0.351, 0.398] | 0.360 [0.337, 0.372] |
| RF-Mordred | 0.850 [0.838, 0.860] | 0.311 [0.288, 0.333] | 0.306 [0.283, 0.319] |
| RF-bFP | 0.832 [0.821, 0.842] | 0.321 [0.293, 0.339] | 0.295 [0.272, 0.308] |
| RF-cFP | 0.845 [0.835, 0.854] | 0.315 [0.280, 0.332] | 0.295 [0.272, 0.311] |
| KNN-bFP | 0.791 [0.778, 0.803] | 0.328 [0.305, 0.347] | 0.323 [0.299, 0.335] |
| KNN-cFP | 0.796 [0.785, 0.809] | 0.333 [0.307, 0.351] | 0.316 [0.292, 0.327] |
- GNN 在平均 AUROC 上达到 0.894,超出 Mordred RF(0.850)和 Morgan 基 RF(0.845)。
- GNN 在大多数描述符的 AUROC 上超越基于位的(bFP)和基于计数的(cFP)指纹。
- GNN 嵌入在全局上按感知相似性组织气味空间,将描述符(如麝香、卷心菜、百合、葡萄)聚集在有意义的区域。
- 在局部上,使用 GNN 嵌入的 KNN 相对于指纹上的 KNN 能检索感知相似的分子,AUROC 为 0.818 对 0.782。
- 嵌入使得迁移学习到未见描述符成为可能,在消融测试中优于 Morgan 指纹和 Mordred 特征。
- 在 DREAM Olfaction Prediction Challenge 背景下,GNN 嵌入在平均皮尔逊相关 r 上与最先进方法竞争,0.55 对 0.54。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。