QUICK REVIEW

[論文レビュー] Analyzing Learned Molecular Representations for Property Prediction

Kevin Yang, Kyle L. Swanson|arXiv (Cornell University)|Apr 2, 2019

Computational Drug Discovery Methods被引用数 204

ひとこと要約

論文は学習された分子表現を固定記述子と比較したベンチマークを、新規の Directed MPNN (D-MPNN) を用い、結合中心のメッセージ伝播で、公開データセットと企業データセットの強力な性能を示し、頑健な一般化のために scaffold ベースの分割を強調します。

ABSTRACT

Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.

研究の動機と目的

学習された分子表現が従来のフィンガープリント/記述子と物性予測でどのように比較されるかを評価する。
新しい化学空間での一般化に対応するグラフベースモデル（D-MPNN）を開発・評価する。
学習された表現と固定記述子特徴を組み合わせて精度と頑健性を向上させる。
公開データと大規模なプロプライエタリ産業データセットの両方でモデルを評価し、実世界への適用性を測る。
スプリット戦略（ scaffold 対 random）とハイパーパラメータ最適化の性能への影響を調査する。

提案手法

結合中心のメッセージ伝播により totter を低減する Directed MPNN を導入する。
結合中心のメッセージ伝播と分子表現へアグリゲートするリードアウトを組み合わせる。
リードアウト時に RDKit由来のグローバル分子特徴量200個で学習表現を補足する。
ハイパーパラメータ（深さ、隠れ層サイズ、層数、ドロップアウト）をベイズ最適化で調整する。
予測性能を高めるためアンサンブリングを用いる；単一モデルとアンサンブルの両方の結果を報告する。
分子グラフ上でエンドツーエンドに学習し、監督付き物性予測タスクを行う。

実験結果

リサーチクエスチョン

RQ1グラフベースの学習表現（D-MPNN）は、多様なデータセットで固定記述子を上回るか。
RQ2 scaffold ベースのデータ分割は、一般化とモデルランキングに対して random 分割と比較してどのような影響を与えるか。
RQ3学習表現と固定記述子特徴を組み合わせることで予測精度と頑健性が改善されるか。
RQ4公開データセットと企業データセット全体でハイパーパラメータ最適化とアンサンブリングがモデル性能に及ぼす影響はどの程度か。
RQ5学習表現は産業ベンチマークに対して最先端ベースラインと比べどの程度一般化するか。

主な発見

結合中心メッセージを用いる D-MPNN は、公開データセットおよび企業データセットで記述子ベースおよび従来のグラフモデルと常に同等かそれを上回る。
学習表現と固定記述子のハイブリッドモデルは、いずれか一方のアプローチよりも高い性能とより良い一般化を示す。
scaffold ベースの分割は一般化のより現実的な評価を提供し、産業界で用いられる時系列分割に近い。
ベイズ法によるハイパーパラメータ最適化は性能を大幅に向上させ、アンサンブリングはさらなる改善をもたらす。
データセットの相当な部分で D-MPNN が MoleculeNet ベースラインや Mayr らのモデルと同等または上回る結果を示し、特に回帰タスクや多くの分類データセットで強力。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。