QUICK REVIEW

[Paper Review] To tune or not to tune the number of trees in random forest?

Philipp Probst, Anne‐Laure Boulesteix|arXiv (Cornell University)|May 16, 2017

Machine Learning and Data Classification107 citations

TL;DR

The paper theoretically and empirically shows that the expected error rate of a random forest can be non-monotonic in the number of trees for classification, while Brier score, log loss, and regression MSE are monotonic in T; it argues against tuning T and recommends using a large, computationally feasible T.

ABSTRACT

The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.

Motivation & Objective

Address whether the number of trees T in random forests should be tuned or set to a large feasible value.
Provide theoretical characterization of how the expected error rate behaves as T grows.
Empirically assess the prevalence of non-monotonic error rate patterns across many datasets.
Offer guidance on practical T selection and introduce the OOBCurve tool for assessing convergence.

Proposed method

Derive theoretical expressions for expected performance measures (error rate, Brier score, log loss) as functions of T, using per-observation prediction difficulty ε_i.
Show that the error rate can be non-monotonic in T for classification, while Brier score and log loss are strictly decreasing in T and AUC may be non-monotonic.
Analyze AUC behavior and adapt models to out-of-bag (OOB) error context.
Conduct a large-scale empirical study on 193 classification tasks and 113 regression tasks from OpenML with 2000 trees and 1000 random seeds to observe OOB curves.
Provide an R package OOBCurve to compute OOB curves for various measures.

Experimental results

Research questions

RQ1Is the expected classification error rate as a function of the number of trees T monotonic or can it be non-monotonic under certain data conditions?
RQ2Do other performance measures (Brier score, logarithmic loss, MSE, AUC) exhibit monotonic behavior in T, and under what circumstances?
RQ3How prevalent are non-monotonic error-rate patterns in real data, and what dataset characteristics predict them?
RQ4Should practitioners tune T or simply use a large, computationally feasible T based on convergence properties?
RQ5Can the OOBCurve tool aid in assessing convergence and guide T selection?

Key findings

The expected classification error rate can be non-monotonic in T for some observations, leading to non-monotonic average error curves across datasets.
For binary classification, the Brier score and logarithmic loss are strictly decreasing in T on average, while AUC can be non-monotonic in some cases.
For regression, the mean squared error decreases with T, while some median-based errors may show non-monotonicity in certain regions.
Empirically, about 10% of OpenML datasets showed non-monotonic OOB error-rate curves, often with ε_i values near 0.5 driving the effect.
Non-monotonic patterns are more common for small datasets; larger convergence of OOB curves was observed with 2000 trees.
The study supports recommending using a computationally feasible large T rather than tuning T, aided by convergence diagnostics of the desired performance measure.
An R package OOBCurve is introduced to compute OOB curves for multiple performance measures.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.