QUICK REVIEW

[论文解读] Selective Sequential Model Selection

William Fithian, Jonathan Taylor|arXiv (Cornell University)|Dec 8, 2015

Machine Learning and Algorithms参考文献 28被引用 40

一句话总结

本文提出了一种选择性序列模型选择的框架，该框架在自适应模型路径（如前向逐步回归或套索）的每一步中构建有效的p值，同时考虑了数据依赖的模型选择。该框架引入了选择性最大t检验和下一进入检验，其在原假设下产生独立且均匀分布的p值，从而支持具有强误差率保证的FDR控制序列停止规则。

ABSTRACT

Many model selection algorithms produce a path of fits specifying a sequence of increasingly complex models. Given such a sequence and the data used to produce them, we consider the problem of choosing the least complex model that is not falsified by the data. Extending the selected-model tests of Fithian et al. (2014), we construct p-values for each step in the path which account for the adaptive selection of the model path using the data. In the case of linear regression, we propose two specific tests, the max-t test for forward stepwise regression (generalizing a proposal of Buja and Brown (2014)), and the next-entry test for the lasso. These tests improve on the power of the saturated-model test of Tibshirani et al. (2014), sometimes dramatically. In addition, our framework extends beyond linear regression to a much more general class of parametric and nonparametric model selection problems. To select a model, we can feed our single-step p-values as inputs into sequential stopping rules such as those proposed by G'Sell et al. (2013) and Li and Barber (2015), achieving control of the familywise error rate or false discovery rate (FDR) as desired. The FDR-controlling rules require the null p-values to be independent of each other and of the non-null p-values, a condition not satisfied by the saturated-model p-values of Tibshirani et al. (2014). We derive intuitive and general sufficient conditions for independence, and show that our proposed constructions yield independent p-values.

研究动机与目标

解决在模型复杂度递增的序列路径中，自适应模型选择后进行有效统计推断的挑战。
开发考虑数据依赖的模型路径选择的p值，确保在选择偏差存在的情况下仍能控制第一类错误率。
使序列停止规则（如ForwardStop、Li-Barber）得以应用，这些规则要求原假设下p值相互独立，而这是先前基于饱和模型的p值所不满足的条件。
将该框架扩展至一般参数和非参数设置，包括变点检测。
确保用于序列停止规则的p值在原假设下相互独立，这是序列设置中FDR控制的必要条件。

提出的方法

基于选择事件的条件推断，通过条件化于充分统计量和模型路径历史，构建选择性p值。
针对前向逐步回归，提出选择性最大t检验，该检验基于给定所选模型路径的条件下最大t统计量的条件分布计算p值。
针对套索，引入下一进入检验，该检验在条件原假设分布下评估下一个进入模型的变量的显著性。
通过推导选择事件与检验统计量在各步之间条件独立的条件，确保原假设下p值的独立性。
将该框架应用于非参数变点检测，通过定义一种基于两样本检验统计量的贪心路径算法添加变点，并通过置换抽样推导p值。
使用重采样（置换或MCMC）在条件原假设下计算精确p值，确保在模型选择下的均匀性和有效性。

实验结果

研究问题

RQ1我们能否在自适应模型路径的每一步中构建有效的p值，以反映模型序列的数据依赖选择？
RQ2所提出的p值是否满足ForwardStop和Li-Barber等FDR控制序列停止规则所要求的独立性条件？
RQ3该框架能否扩展至线性模型之外的一般参数和非参数设置，如变点检测？
RQ4所提出的检验（如最大t检验、下一进入检验）在功效上与Tibshirani等人（2014）的饱和模型p值相比如何？
RQ5在何种充分条件下，从选择性推断框架中推导出的p值在原假设下相互独立？

主要发现

选择性最大t检验和下一进入检验产生的p值在原假设下服从均匀分布，且在各步之间相互独立，满足FDR控制所要求的条件。
所提出的p值相较于Tibshirani等人（2014）的饱和模型p值具有显著更高的统计功效，尤其在模型选择的早期步骤中表现更优。
在糖尿病数据集中，最大t检验的p值在第8步（glu²）即实现模型选择，而饱和模型p值则在第9步（age²）才选择，表明能更早检测到有意义的预测变量。
该框架通过条件化于充分统计量和选择历史，确保即使在自适应选择模型路径的情况下，p值依然有效。
对于非参数变点检测，基于两样本检验统计量的贪心路径算法通过置换抽样生成有效p值，且由于选择过程的结构，保证了原假设下p值的独立性。
推导出p值在原假设下相互独立的理论条件，将该框架推广至广泛的参数和非参数问题，超越线性模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。