[论文解读] A critical look at the current train/test split in machine learning
论文批评了固定的训练/测试划分,并提出带删除的自适应主动学习(Adaptive Active Learning, AAL)以更好地处理冷启动和数据稀缺场景,在药物–蛋白结合和 CIFAR-10 上展示了更高的数据效率。
The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and financial considerations. In other words, it can be very difficult or even impossible to satisfy assumption (ii). In this article, we intend to access this problem and reiterate the paradigm of active learning, and investigate its potential on solving problems under unconventional train/test split protocols. We further propose a new adaptive active learning architecture (AAL) which involves an adaptation policy, in comparison with the traditional active learning that only unidirectionally adds data points to the training pool. We primarily justify our points by extensively investigating an interdisciplinary drug-protein binding problem. We additionally evaluate AAL on more conventional machine learning benchmarking datasets like CIFAR-10 to demonstrate the generalizability and efficacy of the new framework.
研究动机与目标
- Argue that static train/test splits are limiting for real-world, data-scarce problems like drug discovery.
- Revisit active learning and highlight distribution-shift issues in traditional AL setups.
- Propose Adaptive Active Learning (AAL) with a deletion adaptation policy to improve data efficiency.
- Demonstrate AAL on interdisciplinary drug–protein binding data and benchmark datasets (CIFAR-10).
提出的方法
- Introduce Adaptive Active Learning (AAL) framework that alternates data addition with a deletion-based adaptation step.
- Instantiate AAL with a simple deletion policy (AAL-delete) to remove ill-behaved data points after each addition.
- Define data quality metrics for adding/deleting (entropy, cosine distance in feature space, and uncertainty via a model ensemble/Dropout).
- Use a hybrid sampling strategy combining exploitation (high predicted affinity) and exploration (uncertainty/diversity) for additions.
- Evaluate on KIBA protein–drug binding data and CIFAR-10 to test generalizability beyond biology and without hyperparameter tuning of models.
实验结果
研究问题
- RQ1Can adaptive data selection with addition and deletion improve data efficiency under non-static, distribution-shifting conditions?
- RQ2Do conventional fixed train/test splits undermine deployment in drug discovery and similar domains?
- RQ3How does AAL perform compared to traditional active learning and random sampling on real-world and standard benchmarks?
- RQ4What are effective policies for adding and deleting data points in an iterative learning process?
- RQ5Is AAL generalizable across domains like drug discovery and computer vision?
主要发现
- On KIBA, AAL-Hybrid reaches a 0.3 coverage score faster and with less data than baselines.
- AAL-Hybrid uses fewer labeled samples to achieve similar coverage compared with Random and AL-Greedy, indicating higher data efficiency.
- In CIFAR-10, both AL and AAL outperform random, and AAL shows stronger performance as training set grows.
- AAL with deletion consistently outperforms AL-Hybrid and AL-Greedy in data-constrained scenarios.
- Hybrid strategies (combining addition with uncertainty-guided deletion) mitigate distribution shift and avoid local minima common to pure greedy strategies.
- The study demonstrates the framework’s generalizability beyond drug discovery to standard ML benchmarks.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。