QUICK REVIEW

[論文レビュー] Trees, forests, and impurity-based variable importance

Erwan Scornet|arXiv (Cornell University)|Jan 13, 2020

Neural Networks and Applications参考文献 26被引用数 27

ひとこと要約

この論文は、ランダムフォレストで広く用いられる変数重要度指標であるMean Decrease Impurity (MDI) について、初めて理論的裏付けを提供する。入力変数が独立で相互作用がない場合、MDIが応答変数の分散分解を推定することを証明し、これらの理想化された条件下で回帰木およびランダムフォレストにおけるMDIの解釈に厳密な基礎を提供する。

ABSTRACT

Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use random forest variable importances in such way: we do not even know what these quantities estimate. In this paper, we analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI). We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.

研究の動機と目的

ランダムフォレストにおける標準的な変数重要度指標であるMean Decrease Impurity (MDI) の理論的裏付けを提供すること。
回帰木およびランダムフォレストの文脈において、MDIが実際に何を推定しているかを明らかにすること、特に理想化された条件下での解釈を明確にすること。
入力変数が従属している、または相互作用がある場合、変数重要度の概念が本質的に定義不可能になるため、MDIの限界を調査すること。
MDIが出力分散の有効な分解と見なせる条件を確立すること。
主要な解釈可能性ツールを理論的原則に基づくことで、ランダムフォレストの解釈可能性を高めること。

提案手法

制御された条件下で回帰木におけるMDIの理論的挙動を、再帰的分割フレームワークを用いて分析する。
特定の変数（例：$X^{(1)}$ や $X^{(2)}$）に体系的に分割を割り当てる理論的木構造を用い、変数の寄与を分離する。
分散分解技術を適用して、MDIが各変数の総出力分散への寄与に一致することを示す。
木のレベル数 $k \to \infty$ の極限を用いた解析により、MDI値の漸近的表現を導出する。
異なる木構造（例：すべての分割が $X^{(1)}$ に集中している vs. すべて $X^{(2)}$ に集中している）におけるMDI値を比較し、i.i.d. 入力条件下での対称性と一貫性を示す。
技術的補題を活用して、特定の変数に沿った分割による分散の総減少量が、応答変数の周辺分散に関連する明確に定義された量に収束することを証明する。

実験結果

リサーチクエスチョン

RQ1Mean Decrease Impurity (MDI) はランダムフォレストにおいて実際に何を推定しているのか？
RQ2MDI が変数重要度の有効かつ解釈可能な指標となる条件は何か？
RQ3入力変数が従属している、または応答関数に相互作用がある場合、MDI はどのように振る舞うか？
RQ4回帰木においてMDIが出力の分散分解として理論的に正当化できるか？
RQ5入力変数が相関している、または相互作用が存在する場合、MDI の限界は何か？

主な発見

入力変数が独立でモデル内に相互作用がない場合、MDI は出力の有効な分散分解を提供する。
相互作用がなく、入力変数が独立している場合、各変数のMDI値は、出力分散総体に対するその変数の寄与に正確に一致する。
入力変数が従属している、または相互作用がある場合、変数重要度の概念自体が本質的に定義不可能になり、MDI はマージナル寄与として意味的に解釈できない。
応答関数が $Y = X^{(1)} + X^{(2)}$ で対称なモデルにおいて、$X^{(1)}$ および $X^{(2)}$ のMDI値は、木のレベル数が増加するに従い、$\frac{1}{3} - \frac{1}{3}\left(\frac{1}{4}\right)^\beta$ に漸近的に収束する。
理論的分析により、仮定されたi.i.d. および独立性条件下では、MDI がカテゴリ数の多い変数や頻度の高いカテゴリの変数にバイアスを受けることはないことが確認されたが、これは相関する特徴量には適用されない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。