QUICK REVIEW

[论文解读] Scaffold Splits Overestimate Virtual Screening Performance

Qianrong Guo, Saiveth Hernández-Hernández|arXiv (Cornell University)|Jun 2, 2024

Molecular Biology Techniques and Applications被引用 6

一句话总结

本文表明基于骨架的数据分割高估了虚拟筛选的性能，因为骨架之间的相似性会提升训练集与测试集的相似性；基于UMAP的分割在模型和数据集上显示出更差的性能。

ABSTRACT

Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS

研究动机与目标

评估基于骨架的数据分割是否能真实基准AI模型在虚拟筛选（VS）中的性能。
在多样化的类似VS的数据集上，将基于骨架的分割与基于聚类的分割（Butina、UMAP）进行比较。
评估分割方法学如何影响在大规模化合物库中报告的模型性能。

提出的方法

使用三种数据分割策略：骨架分割、Butina聚类、以及UMAP聚类。
在60个NCI-60数据集上训练并评估三种代表性AI模型（每个数据集约有30k–50k分子）。
分析跨算法和分割的2100次模型评估以比较性能。
评估相较于更现实的分割，基于骨架的分割如何使训练集与测试集之间的相似性增加。

实验结果

研究问题

RQ1基架分割是否会使VS性能相对于聚类分割被高估？
RQ2在大规模的类似VS数据集中，UMAP和Butina聚类分割如何影响模型性能？
RQ3骨架分割是否给VS模型带来过于乐观的基准？如果有，幅度有多大？
RQ4是否应避免在其他分子性质预测任务中使用骨架分割？

主要发现

骨架分割在模型和数据集上的报告性能高于更现实的分割。
基于UMAP的分割产生显著更差的性能，挑战了骨架分割是虚拟筛选数据分割现实代理的观点。
该研究在60个数据集上对三种分割方法和多种模型进行了2100次模型评估，指示出稳健的趋势。
结果表明应避免在VS基准测试以及可能在其他分子性质预测问题中使用骨架分割。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。