QUICK REVIEW

[論文レビュー] Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

Sebastian Raschka|arXiv (Cornell University)|Nov 13, 2018

Machine Learning and Data Classification参考文献 27被引用数 255

ひとこと要約

一般化を評価するための手法のサーベイ、モデル選択、アルゴリズム選択、偏り、クロスバリデーション、ブートストラップ、統計検定に関する指針を提供。

ABSTRACT

The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings. This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning. Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and k-fold cross-validation are reviewed, the bias-variance trade-off for choosing k is discussed, and practical tips for the optimal choice of k are given based on empirical evidence. Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed. Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.

研究の動機と目的

一般化性能をどのように推定するか、モデル選択とアルゴリズム選択との関係を説明する。
ホールドアウト、クロスバリデーション、ブートストラップのアプローチを対比し、それらのバイアス・分散のトレードオフを説明する。
アルゴリズムの比較における統計検定と多重比較補正について議論する。
小規模および大規模データセットにおけるベストプラクティスの推奨と、モデル選択時の特徴量選択の指針を提供する。

提案手法

機械学習で用いられる一般的な評価および選択技術をレビューし統合する。
i.i.d.データなどの仮定と評価タスクを明確にする用語について議論する。
ホールドアウト検証、層化、悲観的バイアスを図解付きで説明する。
不確実性を定量化する手段としてブートストラッピングと反復ホールドアウトを提示する。
k分割交差検証、そのバイアス–分散のトレードオフとモデル選択への影響を説明する。
分類器とアルゴリズムを比較するための統計検定（例: F検定、McNemar、Dietterichの5x2cvt検定、Alpaydinの5x2cvF検定）を調査し、ネストされた交差検証について論じる。

実験結果

リサーチクエスチョン

RQ1モデル評価と選択のためのホールドアウト検証の利点と限界は何か？
RQ2ブートストラップおよびリサンプリング法は、性能指標の不確実性を推定する際、クロスバリデーションとどう比較されるか？
RQ3分類器とアルゴリズムを比較するのに適切な統計検定は何か、そして多重比較はどう扱うべきか？
RQ4小規模データセットと大規模データセットにおけるアルゴリズムとモデル選択の実務的推奨は何か？
RQ5層化が性能推定値および交差検証の偏りと分散に与える影響はどうなるか？

主な発見

ホールドアウト検証は単純ですが、小規模データセットでは偏りが生じることがあり、層化は偏りと分散を軽減できる。
反復ホールドアウトとブートストラップ法は、単一の分割よりも堅牢な不確実性推定を提供する。
k分割交差検証はバイアス-分散のトレードオフを含み、モデル選択での利用に影響を与える。
複数の統計検定（例: McNemar、F検定、Dietterichの5x2cv検定、Alpaydinの5x2cv検定）を用いてアルゴリズムの比較をサポートし、多重比較を制御する。
小規模データセットでアルゴリズムを比較する場合、ネストされた交差検証と組み合わせの5x2cv F検定が推奨される。
本論文は、単純さの法칙、過学習リスク、訓練データとテストデータを分離することの重要性などの実践的な考慮事項を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。