QUICK REVIEW

[論文レビュー] Some asymptotic results of survival tree and forest models

Yifan Cui, Ruoqing Zhu|arXiv (Cornell University)|Jul 30, 2017

Statistical Methods and Inference被引用数 3

ひとこと要約

本稿では、故障分布推定における打ち切りを考慮することで、一貫性と予測精度を向上させるバイアス補正付き分割ルールを生存木および生存ランダムフォレストに提案する。有限次元および高次元設定下での一貫性を証明し、収束速度は故障関連変数にのみ依存する。シミュレーションにより、予測誤差の低減を確認する。

ABSTRACT

Random survival forest and survival trees are popular models in statistics and machine learning. However, there is a lack of general understanding regarding consistency, splitting rules and influence of the censoring mechanism. In this paper, we investigate the statistical properties of existing methods from several interesting perspectives. First, we show that traditional splitting rules with censored outcomes rely on a biased estimation of the within-node failure distribution. To exactly quantify this bias, we develop a concentration bound of the within-node estimation based on non i.i.d. samples and apply it to the entire forest. Second, we analyze the entanglement between the failure and censoring distributions caused by univariate splits, and show that without correcting the bias at an internal node, survival tree and forest models can still enjoy consistency under suitable conditions. In particular, we demonstrate this property under two cases: a finite-dimensional case where the splitting variables and cutting points are chosen randomly, and a high-dimensional case where the covariates are weakly correlated. Our results can also degenerate into an independent covariate setting, which is commonly used in the random forest literature for high-dimensional sparse models. However, it may not be avoidable that the convergence rate depends on the total number of variables in the failure and censoring distributions. Third, we propose a new splitting rule that compares bias-corrected cumulative hazard functions at each internal node. We show that the rate of consistency of this new model depends only on the number of failure variables, which improves from non-bias-corrected versions. We perform simulation studies to confirm that this can substantially benefit the prediction error.

研究の動機と目的

生存木およびフォレストモデルにおける一貫性、分割ルール、打ち切り効果に関する理論的理解の不足を解消すること。
非i.i.d.な打ち切り生存データに依存する従来の分割ルールに生じるバイアスを定量化すること。
ノード推定におけるバイアスが是正されていない場合でも、生存木およびフォレストが一貫性を保つための条件を確立すること。
内部ノードにおける累積ハザード推定のバイアスを是正する新しい分割ルールを開発すること。
収束速度が故障関連変数にのみ依存するように改善されたことを示すこと、全共変数セットに依存しないこと。

提案手法

打ち切りを考慮した非i.i.d.標本下でのノード内故障分布推定に対する濃度バインディングを導出する。
生存木における一変量分割が故障分布と打ち切り分布の混合に与える影響を分析する。
各内部ノードにおけるバイアス補正付き累積ハザード関数に基づく新しい分割ルールを導入する。
弱い相関を持つ共変数を伴う有限次元および高次元設定下で、新しいモデルの一貫性を理論的に確立する。
濃度バインディングを全フォレストに適用し、推定誤差の伝播を定量的に評価する。
シミュレーションスタディを実施し、バイアス補正ありとなしのモデル間の予測誤差を比較する。

実験結果

リサーチクエスチョン

RQ1打ち切りバイアスは、生存木およびフォレストモデルの一貫性にどのように影響するか？
RQ2分割ルールがノード内故障分布推定にバイアスを生じる場合でも、生存木およびフォレストは一貫性を保てるか？
RQ3弱い相関を持つ共変数を伴う高次元設定下で、一貫性を保証する条件は何か？
RQ4生存フォレストモデルの収束速度は、故障変数および打ち切り変数の数にどのように依存するか？
RQ5バイアス補正付き分割ルールは、予測誤差を低減し、収束速度を改善できるか？

主な発見

従来の生存木の分割ルールは、打ち切りデータの影響によりノード内故障分布推定にバイアスを生じる。
非i.i.d.標本に対する濃度バインディングを導出し、全フォレストにおける推定誤差を定量的に評価する応用を実施した。
分割変数およびカットポイントが適切な条件下で選ばれていれば、バイアス補正がなくても生存木およびフォレストは一貫性を保つ。
提案されたバイアス補正付き分割ルールは、全共変数の数ではなく故障関連変数の数にのみ依存する一貫性レートを達成する。
シミュレーションスタディにより、バイアス補正ありモデルが非バイアス補正モデルに比べて顕著に予測誤差を低減することが確認された。
理論的結果は、標準的なランダムフォレスト文献で用いられる独立共変数設定に退化し、スパースな高次元モデルにおける一貫性を妥当性検証した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。