QUICK REVIEW

[论文解读] There is no Double-Descent in Random Forests

Sebastian Buschjäger, Katharina Morik|arXiv (Cornell University)|Nov 8, 2021

Machine Learning and Data Classification被引用 3

一句话总结

本文挑战了广泛引用的关于随机森林（RFs）表现出双下降泛化行为的主张，表明实际上随着模型复杂度的增加，RFs 的测试误差仅呈现单次下降。作者证明，即使在使用过拟合决策树生成的数据进行训练时，RFs 也不会以经典方式过拟合，并提出了负相关森林（NCForest）来实证验证：最优性能源于偏差与多样性之间的平衡，而非模型容量。

ABSTRACT

Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent curve: First, the model overfits the data in a u-shaped curve and then, once a certain model complexity is reached, it suddenly improves its performance again. In this paper, we challenge the notion that model capacity is the correct tool to explain the success of RF and argue that the algorithm which trains the model plays a more important role than previously thought. We show that a RF does not exhibit a double-descent curve but rather has a single descent. Hence, it does not overfit in the classic sense. We further present a RF variation that also does not overfit although its decision boundary approximates that of an overfitted DT. Similar, we show that a DT which approximates the decision boundary of a RF will still overfit. Last, we study the diversity of an ensemble as a tool the estimate its performance. To do so, we introduce Negative Correlation Forest (NCForest) which allows for precise control over the diversity in the ensemble. We show, that the diversity and the bias indeed have a crucial impact on the performance of the RF. Having too low diversity collapses the performance of the RF into a a single tree, whereas having too much diversity means that most trees do not produce correct outputs anymore. However, in-between these two extremes we find a large range of different trade-offs with all roughly equal performance. Hence, the specific trade-off between bias and diversity does not matter as long as the algorithm reaches this good trade-off regime.

研究动机与目标

挑战此前在深度神经网络中报告并在随机森林中被声称存在的随机森林表现出双下降泛化行为的主张。
探究模型容量或训练算法是否是随机森林对过拟合具有鲁棒性的主要驱动因素。
评估偏差与多样性在集成模型性能中的作用，特别是其与泛化误差的关系。
开发并验证一种新算法——负相关森林（NCForest），以实现对树集成中多样性的可控调节。

提出的方法

作者使用 Rademacher 复杂度作为模型复杂度的度量，但认为每棵树的平均决策节点数比森林中的总节点数更合适。
他们在不同树深和数据集大小下比较了随机森林与决策树的测试误差曲线，表明 RFs 呈现单次下降，而 DTs 展现出经典的 U 形过拟合现象。
他们提出了负相关森林（NCForest），一种改进的 RF 算法，通过相关性惩罚显式控制树之间的多样性。
他们将集成损失分解为偏差和多样性分量，使用公式：集成损失 = 偏差 + 1/(2M) * Σ(di * T_D * di)，其中 di 是树 i 相对于集成的偏差。
他们在多个数据集（Adult、Bank、EEG、Magic、Nomao）上进行实验，通过 5 折交叉验证平均结果，以评估不同多样性水平下的性能。
他们分析了多样性、偏差与测试误差之间的关系，识别出一种“浴盆形”相关性，即过低或过高多样性均会降低性能。

实验结果

研究问题

RQ1随着模型复杂度的增加，随机森林是否如先前所声称的那样在测试误差上表现出双下降曲线？
RQ2在过拟合决策树生成的数据上训练的随机森林，其性能与原始过拟合决策树相比如何？
RQ3在表现良好的随机森林生成的数据上训练的决策树是否能避免过拟合，还是会继承其源模型的过拟合行为？
RQ4Rademacher 复杂度是否是随机森林中泛化性能的可靠预测指标？
RQ5树集成中偏差与多样性的最优平衡是什么？它如何影响泛化误差？

主要发现

随机森林不表现出双下降；相反，随着模型复杂度的增加，其测试误差仅呈现单次下降，表明不存在经典意义上的过拟合。
即使在使用过拟合决策树生成的数据进行训练时，随机森林也不会过拟合，证明了集成算法能够防止过拟合，无论基学习器的行为如何。
在表现良好的随机森林生成的数据上训练的决策树仍然会过拟合，表明随机森林的良好泛化能力不会被其单个树继承。
Rademacher 复杂度无法作为性能的预测指标：与随机森林相比，复杂度更低的决策树可能具有显著更差的测试误差。
在 NCForest 中存在一个广泛的多样性水平范围，其性能相近，表明精确的偏差-多样性权衡不如达到平衡状态重要。
随机森林的最优性能是通过偏差与多样性的平衡权衡实现的，过低或过高的多样性都会降低性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。