QUICK REVIEW

[논문 리뷰] Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction

Talia B. Kimber, Sebastian Engelke|arXiv (Cornell University)|2018. 12. 11.

Machine Learning in Materials Science참고 문헌 6인용 수 42

한 줄 요약

이 논문은 SMILES 표현에 CNN을 사용하고 데이터 증강을 위한 SMILES 다중성을 활용하는 Convolutional Neural Fingerprint (CNF) 모델을 도입하여 전통적 descriptors와 경쟁력 있는 정확도를 달성하고, 작은 데이터셋에서 종종 결과를 향상시킵니다.

ABSTRACT

In our study, we demonstrate the synergy effect between convolutional neural networks and the multiplicity of SMILES. The model we propose, the so-called Convolutional Neural Fingerprint (CNF) model, reaches the accuracy of traditional descriptors such as Dragon (Mauri et al. [22]), RDKit (Landrum [18]), CDK2 (Willighagen et al. [43]) and PyDescriptor (Masand and Rastija [20]). Moreover the CNF model generally performs better than highly fine-tuned traditional descriptors, especially on small data sets, which is of great interest for the chemical field where data sets are generally small due to experimental costs, the availability of molecules or accessibility to private databases. We evaluate the CNF model along with SMILES augmentation during both training and testing. To the best of our knowledge, this is the first time that such a methodology is presented. We show that using the multiplicity of SMILES during training acts as a regulariser and therefore avoids overfitting and can be seen as ensemble learning when considered for testing.

연구 동기 및 목표

Convolutional neural networks와 여러 SMILES 표현 사이의 시너지가 분자 예측에서 어떻게 작용하는지 입증한다.
CNF 모델에 대한 데이터 증강 규제 역할로 SMILES 다중성이 작용하는지 보여준다.
CNF의 성능을 전통적 descriptors 및 다른 신경망 모델과 회귀 및 분류 작업 전반에 걸쳐 비교한다.

제안 방법

SMILES를 CNN 층으로 처리되는 원-핫 인코딩 문자열로 표현하여 neural fingerprint를 생성한다.
ResNet 및 neural fingerprint 개념에서 영감을 받은 평면형(flat) 및 계층형(hierarchical) CNN 아키텍처를 포함한다.
컨볼루션 이후 해싱을 사용하여 locality-sensitive 임베딩을 밀집한 피처로 해시한다.
훈련 및 테스트 중 SMILES 증강을 적용하여 데이터 증강 및 앙상블 효과를 생성한다.

실험 결과

연구 질문

RQ1CNN 기반 SMILES 특성 추출이 QSAR/QSPR 작업에서 전통적 분자 서술자와 경쟁할 수 있는가?
RQ2훈련 및 테스트 중 SMILES 다중성을 증가시키는 것이 Canonical SMILES만 사용하는 것보다 예측 성능을 향상시키는가?
RQ3데이터셋 크기가 다른 경우 CNF의 회귀 및 분류 타깃에서 성능이 어떻게 달라지는가?

주요 결과

CNF는 SMILES 증강을 사용할 때 Dragon, RDKit, CDK2, PyDescriptor 등의 전통적 descriptors와 종종 동등하거나 우수한 성능을 보인다.
훈련 중 SMILES를 증강하는 것이 데이터 증강 이점과 일치하게 예측 성능을 크게 향상시킨다.
테스트 중 증강만으로는 일반적으로 성능이 저하되며, 사전 노출 없이 비정규 SMILES를 모델이 잘 매핑하지 못함을 시사한다.
훈련과 테스트 모두에서 SMILES 증강을 사용하면 데이터 증강 및 앙상블 효과를 모두 얻을 수 있어 최상의 성능을 달성한다.
CNF는 회귀 및 분류 작업에서 여러 타깃에 대해 DeepChem의 최신 모델과 동등하거나 더 나은 성능을 보이는 경우가 많다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.