QUICK REVIEW

[论文解读] Unbiased estimates for linear regression via volume sampling

Michał Dereziński, Manfred K. Warmuth|arXiv (Cornell University)|May 19, 2017

Sparse and Compressive Sensing Techniques参考文献 24被引用 18

一句话总结

本文提出了一种用于线性回归中列选择的体积采样方法，证明了对列子集进行采样后其伪逆可作为完整伪逆的无偏估计。关键贡献在于推导出结果最小二乘解期望损失的精确闭式表达式，表明当采用大小为 $d$ 的体积采样时，期望损失恰好为最优损失的 $(d+1)$ 倍，该结果为最优，且显著优于独立同分布（i.i.d.）采样方法。

ABSTRACT

Given a full rank matrix $X$ with more columns than rows, consider the task of estimating the pseudo inverse $X^+$ based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times $X^{+ op}X^+$. Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from $X$. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels. We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.

研究动机与目标

开发一种采样方法，以对列数多于行数的宽矩阵 $\mathbf{X}$ 的伪逆 $\mathbf{X}^+$ 产生无偏估计。
建立体积采样与线性最小二乘回归之间的基本联系。
推导出在体积采样下估计器偏差与协方差的精确期望公式。
通过最小化所需标记列数来提高线性回归中的样本效率，同时保持损失有界。
设计一种时间复杂度优于当前最先进方法的更快体积采样算法。

提出的方法

从 $\mathbf{X}$ 中以与子集 $\mathbf{X}_S \mathbf{X}_S^\top$ 的行列式平方 $\det(\mathbf{X}_S \mathbf{X}_S^\top)$ 成比例的概率，采样大小为 $s \geq d$ 的列子集 $S$。
利用子矩阵 $\mathbf{X}_S$ 的逆来计算子问题的权重向量 $\mathbf{w}^{*}(S) = (\mathbf{X}_S)^+ \mathbf{y}_S$。
利用 Sherman-Morrison 公式，在迭代采样过程中高效地维护和更新 Gram 矩阵 $\mathbf{X}_S \mathbf{X}_S^\top$ 的逆。
设计一种反向迭代体积采样算法，从所有列开始，以与各列杠杆值成比例的概率逐步移除列。
维护精度矩阵 $\mathbf{Z} = (\mathbf{X}_S \mathbf{X}_S^\top)^{-1}$，并通过秩一更新高效计算其更新。
利用期望公式 $\mathbb{E}[(\mathbf{X} \mathbf{I}_S)^+] = \mathbf{X}^+$ 证明权重向量估计器的无偏性。

实验结果

研究问题

RQ1当选择 $s \geq d$ 列时，体积采样能否对伪逆 $\mathbf{X}^+$ 产生无偏估计？
RQ2基于体积采样子集的最小二乘解的期望损失与完整解相比如何？
RQ3体积采样能否在样本大小 $s = d$ 时实现乘法损失界，且该界是否最优？
RQ4如何高效计算体积采样？其时间复杂度是否优于现有方法？
RQ5能否通过重复采样将损失界从 $d+1$ 改进至 $1+\epsilon$？

主要发现

当采样子集大小 $s = d$ 时，通过体积采样获得的估计器 $\mathbf{w}^{*}(S)$ 是最优权重向量 $\mathbf{w}^*$ 的无偏估计，即 $\mathbb{E}[\mathbf{w}^{*}(S)] = \mathbf{w}^*$。
当 $s = d$ 时，采样解的期望损失满足 $\mathbb{E}[L(\mathbf{w}^{*}(S))] = (d+1)L(\mathbf{w}^*)$，且该倍数为最优。
估计器的协方差 $\mathbb{E}[(\mathbf{X} \mathbf{I}_S)^+ (\mathbf{X} \mathbf{I}_S)^{+\top}]$ 具有闭式表达式，等于 $\frac{n-d+1}{s-d+1} \mathbf{X}^{+\top} \mathbf{X}^+$。
所提出的反向迭代体积采样算法时间复杂度为 $O((n-s+d)nd)$，相比当前最先进方法提升了 $n^2$ 倍。
当 $s > d$ 时，通过重复采样大小为 $d$ 的子集，可将损失因子从 $d+1$ 以高概率降低至 $1+\epsilon$。
体积采样优于 i.i.d. 采样方法（如杠杆值采样），后者需 $\Omega(d \log d)$ 个样本才能实现常数损失因子。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。