QUICK REVIEW

[論文レビュー] Scaffold Splits Overestimate Virtual Screening Performance

Qianrong Guo, Saiveth Hernández-Hernández|arXiv (Cornell University)|Jun 2, 2024

Molecular Biology Techniques and Applications被引用数 6

ひとこと要約

本論文は、 scaffold ベースのデータ分割が仮想スクリーニングの性能を過大評価することを示している。理由は、骨格間の類似性が学習用・テスト用の類似性を高めてしまうためである。UMAPベースの分割は、モデルとデータセット全体でより悪い性能を明らかにする。

ABSTRACT

Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS

研究の動機と目的

scaffold ベースのデータ分割が、AIモデルの仮想スクリーニング（VS）を現実的にベンチマークしているかを評価する。
多様な VS ライクデータセットにおいて、scaffold 分割とクラスタリングベース分割（Butina、UMAP）を比較する。
大規模化合物ライブラリにおける報告されたモデル性能に、分割手法がどのように影響するかを評価する。

提案手法

scaffold、Butinaクラスタリング、UMAPクラスタリングの3つのデータ分割戦略を使用する。
60個のNCI-60データセットで、3つの代表的なAIモデルを訓練・評価する（データセットあたり約30k～50k分子）。
アルゴリズムと分割を横断して2100件のモデル評価を分析し、性能を比較する。
scaffoldベースの分割が、より現実的な分割と比べてトレーニングセットとテストセット間の類似性をどの程度膨らませるかを評価する。

実験結果

リサーチクエスチョン

RQ1scaffold 分割は、クラスタリングベース分割と比べてVSの性能を過大評価するか？
RQ2UMAPおよびButinaクラスタリング分割は、大規模なVSライクなデータセットにおけるモデル性能にどのように影響するか？
RQ3scaffold 分割はVSモデルに対して過度に楽観的なベンチマークを生み出しているか、どの程度か？
RQ4他の分子特性予測タスクにおいて、scaffold分割を避けるべきか？

主な発見

Scaffold分割は、より現実的な分割よりも高い報告性能を、モデルとデータセット全体で導く。
UMAPベースの分割は著しく悪い性能を示し、scaffold分割がVSデータ分割の現実性の代理であるという見解に挑戦する。
本研究は3つの分割法と複数モデルに渡って60データセットで2100件のモデル評価を分析し、堅牢な傾向を示している。
結果は、VSベンチマーキング、およびおそらく他の分子特性予測問題において、scaffold分割を避けるべきであることを示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。