QUICK REVIEW

[论文解读] Predictive Inference Is Free with the Jackknife+-after-Bootstrap

Byol Kim, Xu Chen|arXiv (Cornell University)|Feb 20, 2020

Machine Learning and Data Classification参考文献 37被引用 32

一句话总结

引入 jackknife+-after-bootstrap (J+aB)，这是一个对集成预测器的高效包装器，能够在不依赖分布假设的情况下提供分布自由的预测区间，并且覆盖率至少为 1−2α，同时计算成本接近单次集成预测。

ABSTRACT

Ensemble learning is widely used in applications to make predictions in complex decision problems---for example, averaging models fitted to a sequence of samples bootstrapped from the available training data. While such methods offer more accurate, stable, and robust predictions and model estimates, much less is known about how to perform valid, assumption-lean inference on the output of these types of procedures. In this paper, we propose the jackknife+-after-bootstrap (J+aB), a procedure for constructing a predictive interval, which uses only the available bootstrapped samples and their corresponding fitted models, and is therefore "free" in terms of the cost of model fitting. The J+aB offers a predictive coverage guarantee that holds with no assumptions on the distribution of the data, the nature of the fitted model, or the way in which the ensemble of models are aggregated---at worst, the failure rate of the predictive interval is inflated by a factor of 2. Our numerical experiments verify the coverage and accuracy of the resulting predictive intervals on real data.

研究动机与目标

为集成学习输出激发有效、尽量少假设的预测推断。
用一个不需要分布假设的预测区间来包装一个集成方法。
通过对袋外观测复用基模型拟合来保持计算效率。
提供有限样本覆盖性保证并在真实数据上进行经验验证。

提出的方法

提出 jackknife+-after-bootstrap (J+aB) 作为任何基回归算法及聚合函数的包装器。
复用袋外模型以在不增发额外基模型调用的情况下获得留一预测（算法2）。
使用联合分位数方案计算预测区间：在每个 x 处，C_alpha,n,B(x) = [q_{alpha,n}^{-} { mu_phi\i (x) - R_i }, q_{alpha,n}^{+} { mu_phi\i (x) + R_i }], 其中 R_i = |Y_i - mu_phi\i (X_i)|。
确保计算成本为 O(B) 次基模型调用，与单次集成预测的成本相匹配。
提供一个对称性校正版本，使用 Binomial(B) 次抽样以恢复可交换性并实现分布自由的保证。
理论保证：在独立同分布数据和对称性假设下，P(Y_{n+1} ∈ C_alpha,n,B^{J+aB}(X_{n+1})) ≥ 1−2α（有限样本、非渐近）。

实验结果

研究问题

RQ1是否可以通过对集成预测器的包装器在没有强假设的前提下产生有效的、分布自由的预测区间？
RQ2在有限样本下，J+aB 的覆盖性保证及其紧致性是多少？
RQ3该方法是否能够在提供易于使用的预测区间的同时维持与单次集成预测相当的计算效率？
RQ4在具有不同基学习器的真实数据集上，J+aB 的经验表现如何？

主要发现

集成方法	对 R 的调用次数	评估次数	对 φ 的调用次数
Ensemble	B	B n_test	n_test
J+ with Ensemble	B n	Bn(1+n_test)	n(1+n_test)
J+aB	B	B(n+n_test)	n(1+n_test)

在三个真实数据集的实验中，J+aB 区间的覆盖率接近名义水平 1−α。
理论保证在最坏情形下给出分布自由的覆盖率下界为 1−2α，非渐近且在给定假设下对任意 n 与分布有效。
J+aB 的计算成本与产生单次集成预测的成本相当，因此在额外模型拟合方面几乎“免费”。
实证结果表明，J+aB 区间具有信息性，对于不稳定的基学习器（如随机森林）通常比稳定的基学习器（如岭回归）更窄。
在成本方面，J+aB 相对于替代方法 J+ensemble 具有优势，同时提供相当的覆盖率和区间质量。
该方法兼容多种一致性度量（残差、分位数回归、加权残差）和不同的聚合方式。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。