[Paper Review] To tune or not to tune the number of trees in random forest?
The paper theoretically and empirically shows that the expected error rate of a random forest can be non-monotonic in the number of trees for classification, while Brier score, log loss, and regression MSE are monotonic in T; it argues against tuning T and recommends using a large, computationally feasible T.
The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.
Motivation & Objective
- Address whether the number of trees T in random forests should be tuned or set to a large feasible value.
- Provide theoretical characterization of how the expected error rate behaves as T grows.
- Empirically assess the prevalence of non-monotonic error rate patterns across many datasets.
- Offer guidance on practical T selection and introduce the OOBCurve tool for assessing convergence.
Proposed method
- Derive theoretical expressions for expected performance measures (error rate, Brier score, log loss) as functions of T, using per-observation prediction difficulty ε_i.
- Show that the error rate can be non-monotonic in T for classification, while Brier score and log loss are strictly decreasing in T and AUC may be non-monotonic.
- Analyze AUC behavior and adapt models to out-of-bag (OOB) error context.
- Conduct a large-scale empirical study on 193 classification tasks and 113 regression tasks from OpenML with 2000 trees and 1000 random seeds to observe OOB curves.
- Provide an R package OOBCurve to compute OOB curves for various measures.
Experimental results
Research questions
- RQ1Is the expected classification error rate as a function of the number of trees T monotonic or can it be non-monotonic under certain data conditions?
- RQ2Do other performance measures (Brier score, logarithmic loss, MSE, AUC) exhibit monotonic behavior in T, and under what circumstances?
- RQ3How prevalent are non-monotonic error-rate patterns in real data, and what dataset characteristics predict them?
- RQ4Should practitioners tune T or simply use a large, computationally feasible T based on convergence properties?
- RQ5Can the OOBCurve tool aid in assessing convergence and guide T selection?
Key findings
- The expected classification error rate can be non-monotonic in T for some observations, leading to non-monotonic average error curves across datasets.
- For binary classification, the Brier score and logarithmic loss are strictly decreasing in T on average, while AUC can be non-monotonic in some cases.
- For regression, the mean squared error decreases with T, while some median-based errors may show non-monotonicity in certain regions.
- Empirically, about 10% of OpenML datasets showed non-monotonic OOB error-rate curves, often with ε_i values near 0.5 driving the effect.
- Non-monotonic patterns are more common for small datasets; larger convergence of OOB curves was observed with 2000 trees.
- The study supports recommending using a computationally feasible large T rather than tuning T, aided by convergence diagnostics of the desired performance measure.
- An R package OOBCurve is introduced to compute OOB curves for multiple performance measures.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.