QUICK REVIEW

[論文レビュー] Semi-supervised linear regression with missing covariates

Benedict M. Risebrow, Thomas B. Berrett|arXiv (Cornell University)|Feb 14, 2026

Statistical Methods and Bayesian Inference被引用数 0

ひとこと要約

論文は、ラベル付きデータの共変量欠損があり、追加の未ラベルデータがある場合の線形回帰の推定量を開発し、低次元および高次元の結果と構造化・非構造化欠損に対するミニマックス最適性を示す。

ABSTRACT

Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.

研究の動機と目的

partly labelled data で共変量欠損を持つ回帰の動機づけと未ラベルデータの潜在的利益。
MCAR欠損下の低次元および高次元設定のミニマックスレートを特徴づける。
未ラベルデータと欠損共変量パターンを活用する現実的な推定量を開発する。
構造化 vs 非構造化欠損パターン全体にわたる理論的保証（上限・下限）と洞察を提供する。

提案手法

欠損共変量を推定共分散に基づく射影で補完する凸形化推定量を定義し、加重最小二乗法を行う（式(4)）。
O_k および M_k の欠損パターンと、ラベル付き情報と未ラベル情報のバランスを取る重み D_k（oracle D_k^* およびデータ駆動の近似）を導入する。
低次元の結果を拡張するために OSS（ordinary semi-supervised）および supervised の二重交差検定法を導入する。
リスクの非漸近上界とそれに対応するミニマックス下界を提供してレート最適性を確立する。
構造化（ブロック状）欠損と非構造化欠損の両方を扱い、有効サンプルサイズの解釈（α_i）を明示的に示す。
共分散推定手順と誤指定に頑健で高次元領域に適用可能な重み推定手順を提供する。

Figure 3 : CC refers to a complete case analysis of the 100 complete cases via least squares. SI refers to the estimator ( 4 ) with choices of weights $\hat{D}_{1}=\hat{D}_{2}=1$ . ISS refers to our estimator ( 4 ) with oracle weights $\hat{D}_{1}=1,\hat{D}_{2}=\frac{\sigma^{2}}{\sigma^{2}+(\beta^{*

実験結果

リサーチクエスチョン

RQ1ラベル付きサンプルに欠損のあるとき、未ラベルデータをどのように活用できるか？
RQ2欠損パターン（構造化 vs 非構造化）が最適推定レートにどのように影響するか？
RQ3MCAR欠損下で低次元・高次元の両方のレジームでレート最適化された手法は存在するか？
RQ4ブロック状データを用いたOSSと supervise設定のミニマックスリスクとそのレートは？

主な発見

提案された凸緩和により、リスクはISS項と共分散推定誤差に依存する項に分離する。
低次元のOSS設定では、ブロック状欠損および非構造化パターンの両方でリスクの上界と一致する下界が得られる。
高次元設定では、下界が仮説を解決し、上界をOSSへ拡張して定数まで一致するレートを提供する。
未ラベルデータは単純な単調パターンで実効次元を減らす可能性があり、非構造化パターンでは実効サンプルサイズを rho から rho^{1/2} に増加させる。
提案手法は、前述の仮定の下でレート最適な結果をもたらし、ISS への寄与と共分散推定誤差を分離した明示的な境界を提供する。
分析にはシミュレーションと半合成カリフォルニア州住宅データセットの適用が含まれる。

Figure 5 : We compute our estimator ( 4 ) with unlabelled sample size $N$ varying from $50$ to $5{,}000$ . ISS is the ideal semi-supervised estimator ( 4 ). CC is the complete case estimator. Labelled sample sizes are $n_{1}=100$ and $n_{2}$ varying from $0$ to $100{,}000$ . Error bars from 1,000 re

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。