QUICK REVIEW

[论文解读] A Simple and Effective Model-Based Variable Importance Measure

Brandon Greenwell, Bradley C. Boehmke|arXiv (Cornell University)|May 12, 2018

Data Analysis with R参考文献 16被引用 69

一句话总结

提出一种标准化的、基于模型的变量重要性度量，使用偏依赖图（PDPs），可适用于各类监督学习算法，并在 GBMs、NN 和 AutoML 集成模型上进行演示。它还展示如何通过 PDPs 评估交互强度，并与 Friedman 的 H-statistic 进行比较。

ABSTRACT

In the era of "big data", it is becoming more of a challenge to not only build state-of-the-art predictive models, but also gain an understanding of what's really going on in the data. For example, it is often of interest to know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome. Some modern algorithms---like random forests and gradient boosted decision trees---have a natural way of quantifying the importance or relative influence of each feature. Other algorithms---like naive Bayes classifiers and support vector machines---are not capable of doing so and model-free approaches are generally used to measure each predictor's importance. In this paper, we propose a standardized, model-based approach to measuring predictor importance across the growing spectrum of supervised learning algorithms. Our proposed method is illustrated through both simulated and real data examples. The R code to reproduce all of the figures in this paper is available in the supplementary materials.

研究动机与目标

提供一种标准化的方法，在多种监督学习算法中量化预测变量的重要性。
通过部分依赖图将变量重要性与预测变量与结果之间的估计关系联系起来。
允许对集成模型和复杂模型（如 stacking、AutoML）中的变量重要性进行解释。
提供一种使用 PDP 评估预测变量之间潜在交互效应的机制。

提出的方法

从拟合的模型中为每个预测变量计算偏依赖函数。
将变量重要性量化为 PDP 的平坦度量，对连续预测变量使用样本标准差，对分类预测变量使用 (range/4)。
应用 Algorithm 1 在每个预测变量的网格取值上生成 PDP 值。
在线性模型中，证明在独立性和均匀性假设下，该度量与基于标准 t 统计的解释相对应。
通过联合 PDP 的标准差扩展到交互强度，并讨论与 Friedman’s H-statistic 的比较。

实验结果

研究问题

RQ1是否可以通过 PDPs 定义一个单一的、对模型无关的变量重要性分数，使其在不同算法之间具有可解释性？
RQ2PDP 的平坦度（变异性）是否能可靠地指示预测变量对预测结果的影响？
RQ3如何用基于 PDP 的重要性来量化预测变量之间的交互效应？
RQ4在真实数据（如 Ames housing）以及 AutoML/堆叠集成模型中的实际表现如何？

主要发现

基于 PDP 的重要性度量在真实数据中的直观重要性上保持一致，并且可与模型特定的重要性（如 GBMs）相媲美或进行细化。
在 Ames housing 示例中，Overall_Qual、Neighborhood 和 Gr_Liv_Area 成为主要预测变量，与传统重要性相比出现了一些重新排序。
该方法在 Friedman’s 回归 NN 示例中正确识别真实预测变量，优于 Garson 和 Olden 在识别真实变量方面。
该方法仍然适用于堆叠集成和 AutoML，使对复杂管道中的变量重要性有可解释性成为可能。
使用联合 PDP 标准差的交互强度诊断能够识别真实的交互（例如 x1 和 x2），在某些情况下甚至优于 Friedman’s H-statistic。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。