QUICK REVIEW

[论文解读] Variable Selection Inference for Bayesian Additive Regression Trees

Justin Bleich, Adam Kapelner|arXiv (Cornell University)|Oct 18, 2013

Gene expression and cancer classification被引用 3

一句话总结

本文提出了一种基于排列的推断方法，应用于贝叶斯加性回归树（BART），以改进高维非线性设置下的变量选择，特别适用于基因调控网络的发现。该方法通过在变量重要性上引入有信息的先验分布，增强了BART模型，相较于现有方法，在恢复高信号预测变量方面表现出更优性能，相关实现已发布于R包 bartMachine中。

ABSTRACT

We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.

研究动机与目标

解决传统参数方法（如lasso）在处理高维非线性数据时，难以捕捉微弱效应与交互作用所导致的变量选择挑战。
为确定BART中所选预测变量是否具有真实而非虚假的影响，构建一个严谨的推断框架。
将关于变量重要性的有信息先验知识整合到BART框架中，以提高选择的准确性。
评估该方法在恢复复杂高维数据中真实信号方面的性能，特别是在生物背景下的应用。
展示该方法在利用真实生物数据重建酿酒酵母（Saccharomyces cerevisiae）基因调控网络方面的实用性。

提出的方法

该方法采用基于排列的推断方法，评估BART中单个预测变量效应的显著性，检验观察到的效应是否可能由随机因素导致。
通过在变量重要性上引入有信息的先验分布，扩展了BART模型，使领域知识能够引导选择过程。
采用贝叶斯非参数回归方法，通过加性回归树建模复杂非线性关系，无需假设特定的参数形式。
通过BART模型中后验包含概率估计变量重要性，并利用排列检验评估其显著性。
该方法已实现在R包 bartMachine 中，便于在高维数据集中实际应用。
通过模拟实验，将该方法与参数和非参数方法在多种数据生成机制下的表现进行比较。

实验结果

研究问题

RQ1在高维非线性回归设置中，基于排列的推断程序能否可靠地区分真实与虚假的预测变量效应？
RQ2在BART中整合关于变量重要性的有信息先验知识，如何影响变量选择的准确性？
RQ3在何种场景下，所提出的基于BART的方法优于传统的参数与非参数变量选择技术？
RQ4在具有非线性交互作用的复杂高维数据中，该方法在多大程度上能恢复真实的高信号预测变量集合？
RQ5该方法在重建已知的生物调控网络（如酿酒酵母中的网络）方面有多高效？

主要发现

在多种数据设置下的模拟实验中，所提出的方法在性能上优于现有的参数与非参数变量选择程序。
结合有信息先验的BART方法，在高维数据中对具有最大信号的协变量子集的恢复最为有效。
基于排列的推断框架成功识别出真实的预测变量效应，显著降低了变量选择中的假阳性率。
在酿酒酵母基因调控网络的应用中，该方法表现出更强的能力，可恢复出具有生物学意义的转录因子-基因相互作用。
将先验知识整合到BART中，显著提升了变量选择的准确性，同时保持了模型的灵活性。
R包 bartMachine 为基因组学与高维统计领域的研究人员提供了该方法的实用且易于访问的实现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。