QUICK REVIEW

[論文レビュー] To tune or not to tune the number of trees in random forest?

Philipp Probst, Anne‐Laure Boulesteix|arXiv (Cornell University)|May 16, 2017

Machine Learning and Data Classification被引用数 107

ひとこと要約

この論文は、分類のためのランダムフォレストの木の数（T）が増加するにつれて、期待誤差率が非単調になる可能性があることを理論的・実証的に示す一方で、Brierスコア、対数損失、回帰のMSEはTに対して単調性を持つ。Tの調整に反対し、計算的に実行可能な大きなTを用いることを推奨する。

ABSTRACT

The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.

研究の動機と目的

ランダムフォレストにおける木の数 T を調整すべきか、または大きく実行可能な値に設定すべきかを検討する。
T が増えるときの期待誤差率の挙動を理論的に特徴づける。
多数のデータセットにわたって非単調な誤差率パターンの頻度を実証的に評価する。
実務的な T の選択に関する指針を提供し、収束を評価する OOBCurve ツールを紹介する。

提案手法

観測ごとの予測難易度 ε_i を用いて、T の関数として期待性能指標（誤差率、Brier スコア、対数損失）の理論式を導出する。
分類において誤差率は T に対して非単調になる可能性がある一方で、Brier スコアと対数損失は厳密に T の増加とともに単調に減少し、AUC は非単調になる場合がある。
AUC の挙動を分析し、OOB誤差コンテキストにモデルを適応させる。
OpenML の 193 の分類タスクと 113 の回帰タスクを対象に、2000本の木と 1000 のランダムシードで OOB 曲線を観察する大規模な実証研究を実施する。
さまざまな指標の OOB 曲線を計算する R パッケージ OOBCurve を提供する。

実験結果

リサーチクエスチョン

RQ1T の木の数に対して、期待される分類誤差率は単調か、それとも特定のデータ条件下で非単調になり得るのか。
RQ2他の性能指標（Brier スコア、対数損失、MSE、AUC）は T に対して単調性を示すか、どのような状況でそうなるのか。
RQ3実データにおいて非単調な誤差率パターンはどれくらい一般的で、データセットの特徴はそれを予測できるのか。
RQ4実務上は T を調整すべきか、それとも収束性に基づき大きく計算的に実現可能な T を用くべきか。
RQ5OOBCurve ツールは収束性の評価と T の選択の指針に役立つか。

主な発見

いくつかの観測では、分類の期待誤差率が T に対して非単調になることがあり、データセット全体で平均誤差曲線が非単調になる。
二値分類では、平均で Brier スコアと対数損失は T の増加とともに厳密に減少する一方、AUC は場合によって非単調になり得る。
回帰では、平均二乗誤差は T とともに減少するが、中央値ベースの誤差の一部は特定の領域で非単調性を示す可能性がある。
実証的には、OpenML の約10％のデータセットで OOB 誤差率の非単調曲線が見られ、ε_i の値が0.5 に近い場合がこの効果を促進していた。
非単調なパターンは小さなデータセットでより一般的であり、2000本の木では OOB 曲線の収束がより観察された。
この研究は、望ましい性能指標の収束診断を補助として、T を調整するよりも計算的に実行可能な大きな T の使用を推奨することを支持している。
OOBCurve という R パッケージを導入し、複数の性能指標の OOB 曲線を計算する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。