QUICK REVIEW

[论文解读] Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

Lucas Mentch, Siyu Zhou|arXiv (Cornell University)|Oct 31, 2019

Gaussian Processes and Bayesian Inference被引用 42

一句话总结

本文认为随机森林中的额外随机性充当隐式正则化，降低自由度并在低信噪比（SNR）条件下提高性能，并通过仿真和线性模型的类比来展示这一效应。

ABSTRACT

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.

研究动机与目标

解释为什么随机森林的表现超出像插值或方差降低等传统解释。
量化 mtry 参数如何影响随机森林中的模型复杂度（自由度）。
证明在低信噪比（SNR）情境下，随机性带来更大收益。
表明线性模型中的随机化前向选择在正则化效应上与森林中的效果相似。

提出的方法

给出包含显式随机性组件的随机森林表述：数据重采样和特征子抽样（mtry）。
通过 df(f̂)= (1/σ^2) ∑ Cov(ŷ_i, y_i) 定义估计量的自由度。
利用蒙特卡罗试验在不同的 maxnodes 和 mtry 下估计森林的自由度。
将随机森林与袋装法（bagging）及线性模型中的随机前向选择类比进行比较。
使用合成数据（线性和类似 MARS）和现实数据启发的实验，在不同的 SNR 条件下评估性能。
将结果参考/插值到关于插值和正则化的先前工作作为背景。

实验结果

研究问题

RQ1mtry 参数如何影响随机森林的自由度？
RQ2在哪些 SNR 区间中，随机森林相对于非随机化方法（如袋装）提供最大的预测提升？
RQ3线性模型中的随机前向选择过程是否表现出与随机森林相似的正则化效应？
RQ4在低-SNR 设置中，随机性带来的改进主要是由于方差降低、偏差降低，还是两者的结合？

主要发现

增加 maxnodes 会提高森林的自由度，自由度呈现凹性递增。
在固定的 maxnodes 下，较高的 mtry 相对于较低的 mtry 值会带来更高的自由度。
随机森林在低SNR情境下相对于袋装有更显著的优势，且在高SNR时优势减弱。
最佳 mtry 与 SNR 正相关，表明随机性的正则化效应。
线性模型的随机化前向选择类比在嘈杂、低维设置中显示出类似的正则化好处。
随机性充当隐式正则化器，类似于显式正则化方法中的收缩惩罚。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。