QUICK REVIEW

[논문 리뷰] WideDTA: prediction of drug-target binding affinity

Hakime Öztürk, Elif Özkırımlı|arXiv (Cornell University)|2019. 02. 04.

Computational Drug Discovery Methods참고 문헌 9인용 수 152

한 줄 요약

WideDTA는 단백질 시퀀스의 단어 기반 표현, 리간드 SMILES, PROSITE 도메인/모티프, 그리고 리간드 MCS를 사용하여 약물-타겟 결합 친화도를 예측하며 benchmark 데이터셋에서 DeepDTA를 능가한다. 도메인/모티프와 MCS 정보는 키나아제 중심 데이터에서 제한적인 이점을 제공한다.

ABSTRACT

Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the inclusion of protein domain and motif information as well as ligand maximum common substructure words do not provide additional useful information for the deep learning model. Interestingly, however, using only domain and motif information to represent proteins achieved similar performance to using the full protein sequence, suggesting that important binding relevant information is contained within the protein motifs and domains.

연구 동기 및 목표

단백질 및 리간드의 텍스트 기반 표현을 사용해 단백질-리간드 결합 친화도를 예측한다.
도메인/모티프 정보(PDM)와 LMCS 정보가 예측에 추가적인 가치를 제공하는지 평가한다.
단어 기반 WideDTA를 이전의 문자 기반 모델 및 Davis 및 KIBA 데이터셋에서의 전통적 방법과 비교한다.

제안 방법

단백질 서열을 3-자(세 글자) 단어로 표현한다(PS).
리간드 SMILES를 슬라이딩 윈도우를 사용해 8자 단어로 표현한다(LS).
PROSITE에서 단백질 도메인/모티프를 추출하고 이를 3-자 단어로 표현한다(PDM).
리간드 최대 공통 부분구조를 추출하고 이를 단어로 표현한다(LMCS).
각 정보 소스를 두 개의 1D CNN 층과 맥스풀링으로 특징을 얻은 뒤 연결하고 드롭아웃이 있는 세 개의 밀집층으로 통과한다.
Concordance Index(CI), MSE, 및 피어슨 상관계수를 사용해 Davis 및 KIBA 데이터셋에서 학습 및 평가를 수행하고 KronRLS, SimBoost, DeepDTA와 비교한다.

실험 결과

연구 질문

RQ1단백질과 리간드를 문자 기반 방법과 비교했을 때 단어 기반 표현이 결합 친화도 예측을 개선하는가?
RQ2PDM 및 LMCS 단어가 전체 단백질 서열 및 LS/LMCS만 있는 경우보다 추가 예측 가치를 제공하는가?
RQ3Davis 및 KIBA 벤치마크에서 WideDTA의 성능은 최첨단 방법과 어떻게 비교되는가?

주요 결과

네 가지 모듈을 모두 적용한 WideDTA가 Davis에서 가장 좋은 성능을 보인다: CI 0.886 및 MSE 0.262.
KIBA에서 WideDTA의 최적 설정은 CI 0.875 및 MSE 0.179를 달성한다.
PS + LS 만으로도 두 데이터셋에서 이미 DeepDTA를 능가한다(Davis CI 0.874; KIBA CI 0.874).
Davis와 KIBA에서 PDM 추가가 성능을 크게 향상시키지 못했으며 일부 경우 PDM만으로 전체 시퀀스와 유사한 성능을 보였다.
LMCS는 Davis에서 미미한 이점을 제공했으나 KIBA에서는 LS에 비해 덜 유리했다.
KronRLS 및 SimBoost에 비해 WideDTA 변형이 LS/PS(및 PDM/LMCS 조합)의 성능이 일관되게 더 우수하다; Davis에서 WideDTA 최적 CI 0.886 대 KronRLS 0.871 및 SimBoost 0.872; KIBA에서 WideDTA 최적 CI 0.875 대 DeepDTA 0.863.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.