QUICK REVIEW

[论文解读] Semi-supervised linear regression with missing covariates

Benedict M. Risebrow, Thomas B. Berrett|arXiv (Cornell University)|Feb 14, 2026

Statistical Methods and Bayesian Inference被引用 0

一句话总结

该论文在带有缺失协变量的标签数据以及额外未标记数据的情形下，为线性回归开发估计量，给出低维与高维结果，并在结构化与非结构化缺失下实现极小极大下界最优。

ABSTRACT

Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.

研究动机与目标

在部分标记数据中动机化带缺失协变量的回归以及未标记数据的潜在收益。
在MCAR缺失下，表征低维与高维设置的极小极大速率。
开发能够利用未标记数据和缺失协变量模式的实用估计量。
提供理论保证（上界与下界）以及对结构化 vs 非结构化缺失模式的见解。

提出的方法

定义一个凸化的估计量，通过基于估计协方差的投影来对缺失协变量进行插补，并执行带权最小二乘（公式(4)）。
引入带有 O_k 和 M_k 的缺失模式，以及权重 D_k（oracle D_k^* 和数据驱动的）以平衡带标签信息与未标记信息。
开发 OSS（普通半监督）和监督两折交叉拟合方案以拓展低维结果。
提供风险的非渐近上界及匹配的极小极大下界以建立速率最优性。
同时处理结构化（块状）和非结构化缺失，给出明确的有效样本量解释（alpha_i）。
提供协方差估计步骤和对错配鲁棒且适用于高维情形的权重估计程序。

Figure 3 : CC refers to a complete case analysis of the 100 complete cases via least squares. SI refers to the estimator ( 4 ) with choices of weights $\hat{D}_{1}=\hat{D}_{2}=1$ . ISS refers to our estimator ( 4 ) with oracle weights $\hat{D}_{1}=1,\hat{D}_{2}=\frac{\sigma^{2}}{\sigma^{2}+(\beta^{*

实验结果

研究问题

RQ1在带标签样本存在缺失协变量的情况下，未标记数据如何被利用？
RQ2缺失模式（结构化与非结构化）如何影响最优估计速率？
RQ3在MCAR缺失下，方法是否在低维与高维矩阵中都具有速率最优性？
RQ4带块状缺失的数据的 OSS 与监督设定下的极小极大风险及其速率是多少？

主要发现

所提出的凸松弛导致的估计量的风险可分解为 ISS 项与依赖于协方差估计的项。
在低维 OSS 设置下，结果给出带块状缺失与非结构化模式的风险上界与匹配下界。
在高维设置中，论文给出下界解决了一个猜想并将上界扩展到 OSS，速率达到匹配直至常数的程度。
在简单的单调模式中未标记数据可降低有效维度，在非结构化模式中将有效样本量从 rho 转为 rho^{1/2} 的程度增加。
在所给假设下，该方法取得了速率最优的结果，明确的界限将 ISS贡献与协方差估计误差分开。
分析包括仿真和一个半合成的加利福尼亚州住房数据集应用。

Figure 5 : We compute our estimator ( 4 ) with unlabelled sample size $N$ varying from $50$ to $5{,}000$ . ISS is the ideal semi-supervised estimator ( 4 ). CC is the complete case estimator. Labelled sample sizes are $n_{1}=100$ and $n_{2}$ varying from $0$ to $100{,}000$ . Error bars from 1,000 re

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。