QUICK REVIEW

[논문 리뷰] Leffingwell Odor Dataset

Benjamín Sánchez-Lengeling, Jennifer N. Wei|arXiv (Cornell University)|2019. 10. 23.

Olfactory and Sensory Function Studies참고 문헌 39인용 수 95

한 줄 요약

The paper trains graph neural networks on a curated, expert-labeled QSOR dataset to predict odor descriptors from molecular graphs, establishing a learned odor space and showing transferability to related tasks.

ABSTRACT

NOTE: It's easier to download this dataset from pyrfume. Here's how: <pre><code># First install pyrfume in your Python environment. This can be done easily with pip. # pip install pyrfume import pyrfume molecules = pyrfume.load_data('leffingwell/molecules.csv', remote=True) behavior = pyrfume.load_data('leffingwell/behavior.csv', remote=True) # e.g. to count the number of molecules with each descriptor behavior.sum().sort_values(ascending=False).astype(int) </code></pre> Predicting properties of molecules is an area of growing research in machine learning, particularly as models for learning from graph-valued inputs improve in sophistication and robustness. A molecular property prediction problem that has received comparatively little attention during this surge in research activity is building Structure-Odor Relationships (SOR) models (as opposed to Quantitative Structure-Activity Relationships, a term from medicinal chemistry). This is a 70+ year-old problem straddling chemistry, physics, neuroscience, and machine learning. To spur development on the SOR problem, we curated and cleaned a dataset of 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database. We provide featurizations of all molecules in the dataset using bit-based and count-based fingerprints, Mordred molecular descriptors, and the embeddings from our trained GNN model (Sanchez-Lengeling et al., 2019). This dataset is comprised of two files: leffingwell_data.csv: this contains molecular structures, and what they smell like, along with train, test, and cross-validation splits. More detail on the file structure is found in leffingwell_readme.pdf. leffingwell_embeddings.npz: this contains several featurizations of the molecules in the dataset. leffingwell_readme.pdf: a more detailed description of the data and its provenance, including expected performance metrics. LICENSE: a copy of the CC-BY-NC license language. The dataset, and all associated features, is freely available for research use under the CC-BY-NC license. If you use the data in a publication, please cite: <pre>@article{sanchez2019machine, title={Machine learning for scent: Learning generalizable perceptual representations of small molecules}, author={Sanchez-Lengeling, Benjamin and Wei, Jennifer N and Lee, Brian K and Gerkin, Richard C and Aspuru-Guzik, Al{\'a}n and Wiltschko, Alexander B}, journal={arXiv preprint arXiv:1910.10685}, year={2019} }</pre>

연구 동기 및 목표

QSOR을 화학과 신경과학을 아우르는 도전적이고 오래된 문제로 동기 부여한다.
향수 데이터베이스의 서술어를 표준화하여 대규모 전문가 라벨의 향 데이터셋을 생성한다.
그래프 신경망이 분자 그래프로부터 냄새 서술어를 전통적인 기준선보다 더 효과적으로 예측할 수 있음을 입증한다.
학습된 냄새 임베딩이 지각적 구조를 포착하고 새로운 냄새 서술어에 대한 전이 학습을 지원한다는 것을 보인다.

제안 방법

원자들을 노드로, 결합을 엣지로 하는 그래프로 분자를 표현한다.
그래프 신경망을 학습시켜 138개의 냄새 서술어를 동시에 예측하도록 한다(다중 레이블 분류).
RDKit 비트 지문, Morgan 지문, Mordred 특징을 사용하여 GNN과 기준선(RF 및 k-NN)을 비교한다.
GNN의 penultimate 계층 출력은 고정 차원 냄새 임베딩으로 사용하여 전역 및 국부 구조를 분석한다.
AUROC, 정밀도, F1로 평가하고 부트스트랩 기반 신뢰 구간을 보고한다.
하이퍼파라미터 튜닝 상세 및 구조적 변형(GCN 대 MPNN)을 포함한 부록을 제공한다.

실험 결과

연구 질문

RQ1GNN이 분자 그래프로부터 여러 냄새 서술어에 대해 일반화 가능한 냄새 표현을 학습할 수 있는가?
RQ2학습된 냄새 임베딩이 전역적으로 지각적 관계를 반영하는가(냄새 그룹별 클러스터) 및 국부적으로(지각적으로 유사한 이웃들)?
RQ3GNN 임베딩이 보지 못한( unseen) 또는 새로 정의된 냄새 서술어 예측에 전이 가능한가?
RQ4냄새 임베딩이 학습 데이터셋을 넘어 관련 후각 예측 작업으로 전이되는가?
RQ5다수의 서술어에 대해 GNN 기반 QSOR 성능이 전통적인 특징 기반 기준선과 어떻게 비교되는가?

주요 결과

모델	AUROC (평균 [CI])	정밀도 (평균 [CI])	F1 (평균 [CI])
GNN	0.894 [0.888, 0.902]	0.379 [0.351, 0.398]	0.360 [0.337, 0.372]
RF-Mordred	0.850 [0.838, 0.860]	0.311 [0.288, 0.333]	0.306 [0.283, 0.319]
RF-bFP	0.832 [0.821, 0.842]	0.321 [0.293, 0.339]	0.295 [0.272, 0.308]
RF-cFP	0.845 [0.835, 0.854]	0.315 [0.280, 0.332]	0.295 [0.272, 0.311]
KNN-bFP	0.791 [0.778, 0.803]	0.328 [0.305, 0.347]	0.323 [0.299, 0.335]
KNN-cFP	0.796 [0.785, 0.809]	0.333 [0.307, 0.351]	0.316 [0.292, 0.327]

GNN은 Mordred RF(0.850) 및 Morgan 기반 RF(0.845)와 같은 기준선보다 높은 평균 AUROC(0.894)를 달성한다.
GNN은 대부분의 서술어에서 비트 기반(bFP) 및 개수 기반(cFP) 지문보다 AUROC가 높다.
GNN 임베딩은 지각적 유사성에 따라 냄새 공간을 전역적으로 구성하고 서술어를 의미 있는 영역으로 클러스터링한다(예: musk, cabbage, lily, grape).
Locally, KNN using GNN embeddings retrieves perceptually similar molecules better than KNN on fingerprints (AUROC 0.818 vs 0.782).
임베딩은 보지 못한 서술어에 대한 전이 학습을 가능하게 하며 차감 실험에서 Morgan 지문 및 Mordred 특징보다 우수하다.
In the DREAM Olfaction Prediction Challenge context, GNN embeddings perform competitively with the state-of-the-art on mean Pearson’s r (0.55 vs 0.54).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.