[论文解读] A Debiased MDI Feature Importance Measure for Random Forests
本论文分析随机森林中 MDI 的有限样本偏差,并提出 MDI-oob,一种在袋外样本上计算以减少偏差的 MDI 测量,用于改善特征选择,在仿真和基因组 ChIP 数据集上显示性能提升。
Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. Based on the original definition of MDI by Breiman et al. for a single tree, we derive a tight non-asymptotic bound on the expected bias of MDI importance of noisy features, showing that deep trees have higher (expected) feature selection bias than shallow ones. However, it is not clear how to reduce the bias of MDI using its existing analytical expression. We derive a new analytical expression for MDI, and based on this new expression, we are able to propose a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob achieves state-of-the-art performance in feature selection from Random Forests for both deep and shallow trees.
研究动机与目标
- Characterize non-asymptotic bias of MDI in finite-sample Random Forests.
- Derive a new analytical expression for MDI to enable bias reduction.
- Propose MDI-oob, an out-of-bag based MDI measure for debiased feature importance.
- Demonstrate performance of MDI-oob against other importance measures on simulated and genomic data.
提出的方法
- Review of MDI definition for a single tree and ensemble (Breiman et al.).
- Derivation of a non-asymptotic upper bound on the expected bias of MDI for noisy features under mild assumptions.
- Introduction of a new analytical expression for MDI via a function f_T,k(X) linking MDI to sample covariance with y.
- Proposal of MDI-oob by computing MDI using out-of-bag samples and the new MDI expression.
- Theoretical discussion on how depth and minimum leaf size m_n influence bias (G0(T)).
- Empirical evaluation on simulated data and a genomic ChIP dataset comparing MDI-oob to other feature-importance measures.
实验结果
研究问题
- RQ1How large is the finite-sample bias of MDI for noisy features in Random Forests with varying leaf sizes and depths?
- RQ2Can a new analytical representation of MDI enable debiasing using out-of-bag samples?
- RQ3Does MDI-oob improve feature selection performance compared to standard MDI and other measures in simulations and real genomic data?
- RQ4How do tree depth and minimum leaf size affect bias and the effectiveness of debiasing?
- RQ5How does MDI-oob compare with SHAP, MDA, cforest, and other feature-importance measures in terms of AUC-based noisy feature identification?
主要发现
- MDI feature importance for noisy features grows with deeper trees and smaller leaves, indicating finite-sample bias (tight bound proportional to d_n log(np)/m_n).
- A new analytical expression shows MDI as a sample covariance between y and a feature-specific function f_{T,k}(X), enabling out-of-bag based evaluation.
- MDI-oob computes MDI using out-of-bag samples, reducing bias and achieving state-of-the-art feature selection performance in simulations and genomic data.
- MDI-oob often yields 5–10% higher AUC scores for feature selection compared to other measures in both deep and shallow trees.
- MDI-oob demonstrates strong performance on a simulated dataset with discrete features and on a genomic ChIP dataset, outperforming several packages (party, ranger, scikit-learn).
- The work connects MDI-oob to honest estimation concepts and highlights potential extensions to correlated features and tightened theoretical bounds.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。