QUICK REVIEW

[論文レビュー] A critical look at the current train/test split in machine learning

Jimin Tan, Jianan Yang|arXiv (Cornell University)|Jun 8, 2021

Machine Learning and Algorithms参考文献 39被引用数 43

ひとこと要約

本論文は固定の訓練/テスト分割を批判し、削除を伴う Adaptive Active Learning (AAL) の導入によりコールドスタートやデータ不足の状況により適切に対処できることを提案し、薬物–タンパク質結合と CIFAR-10 におけるデータ効率の向上を示している。

ABSTRACT

The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and financial considerations. In other words, it can be very difficult or even impossible to satisfy assumption (ii). In this article, we intend to access this problem and reiterate the paradigm of active learning, and investigate its potential on solving problems under unconventional train/test split protocols. We further propose a new adaptive active learning architecture (AAL) which involves an adaptation policy, in comparison with the traditional active learning that only unidirectionally adds data points to the training pool. We primarily justify our points by extensively investigating an interdisciplinary drug-protein binding problem. We additionally evaluate AAL on more conventional machine learning benchmarking datasets like CIFAR-10 to demonstrate the generalizability and efficacy of the new framework.

研究の動機と目的

薬物発見のような実世界でデータが不足する問題に対して、静的な訓練/テスト分割が制限となっていると主張する。
アクティブラーニングを再検討し、従来の AL セットアップにおける分布シフトの問題を強調する。
データ効率を改善するための削除適応ポリシーを備えた Adaptive Active Learning (AAL) を提案する。
学際的な薬物–タンパク質結合データとベンチマークデータセット（CIFAR-10）で AAL を実証する。

提案手法

データの追加と削除ベースの適応ステップを交互に行う Adaptive Active Learning (AAL) フレームワークを導入する。
各追加後に不適切なデータ点を削除する単純な削除ポリシー（AAL-delete）で AAL を具体化する。
追加/削除のためのデータ品質指標を定義する（エントロピー、特徴空間のコサイン距離、モデル集合/ Dropout による不確実性）。
追加のために、搾取（高い予測親和性）と探索（不確実性/多様性）を組み合わせたハイブリッドサンプリング戦略を使用する。
生物学を超えた一般化可能性を検証するために KIBA 蛋白質–薬物結合データと CIFAR-10 で評価し、モデルのハイパーパラメータ調整なしで行う。

実験結果

リサーチクエスチョン

RQ1追加と削除を含む適応的なデータ選択が、非静的で分布シフトする条件下でデータ効率を改善できるか？
RQ2従来の固定された訓練/テスト分割は薬物発見や同様の領域でのデプロイを損なうか？
RQ3実世界および標準ベンチマークにおいて、従来のアクティブラーニングやランダムサンプリングと比較して AAL はどう機能するか？
RQ4反復学習プロセスにおけるデータ点の追加と削除の効果的なポリシーは何か？
RQ5AAL は薬物発見とコンピュータビジョンのような分野を超えて一般化できるか？

主な発見

KIBA では、AAL-Hybrid がベースラインより少ないデータでより早く 0.3 のカバレッジスコアに到達する。
AAL-Hybrid は同等のカバレッジを達成するのにより少ないラベル付きサンプルを使用する、データ効率が高いことを示す。
CIFAR-10 では、AL と AAL の両方がランダムより優れており、トレーニングセットが大きくなるにつれて AAL の性能がより高くなる。
データ制約のある状況で、削除を含む AAL は常に AL-Hybrid および AL-Greedy を上回る。
追加と不確実性に基づく削除を組み合わせたハイブリッド戦略は、分布シフトを緩和し、純粋なグリーディ戦略に共通する局所的最適解を回避する。
本研究は、薬物発見を超えた標準的な ML ベンチマークへのフレームワークの一般化可能性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。