QUICK REVIEW

[论文解读] Leveraged volume sampling for linear regression

Michał Dereziński, Manfred K. Warmuth|arXiv (Cornell University)|Feb 19, 2018

Markov Chains and Monte Carlo Methods被引用 44

一句话总结

论文识别了标准体积采样在线性回归中的局限性，并引入杠杆体积采样，一种带有高效拒绝采样算法的重新缩放变体，能够产生无偏估计量，并在 k = O(d log d + d/ε) 时实现 1+ε 的损失界。

ABSTRACT

Suppose an $n imes d$ design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to sample only a small number $k \ll n$ of the responses, and then produce a weight vector whose sum of squares loss over all points is at most $1+ε$ times the minimum. When $k$ is very small (e.g., $k=d$), jointly sampling diverse subsets of points is crucial. One such method called volume sampling has a unique and desirable property that the weight vector it produces is an unbiased estimate of the optimum. It is therefore natural to ask if this method offers the optimal unbiased estimate in terms of the number of responses $k$ needed to achieve a $1+ε$ loss approximation. Surprisingly we show that volume sampling can have poor behavior when we require a very accurate approximation -- indeed worse than some i.i.d. sampling techniques whose estimates are biased, such as leverage score sampling. We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/ε)$ suffices to guarantee total loss at most $1+ε$ times the minimum with high probability. Thus, we improve on the best previously known sample size for an unbiased estimator, $k=O(d^2/ε)$. Our rescaling procedure leads to a new efficient algorithm for volume sampling which is based on a determinantal rejection sampling technique with potentially broader applications to determinantal point processes. Other contributions include introducing the combinatorics needed for rescaled volume sampling and developing tail bounds for sums of dependent random matrices which arise in the process.

研究动机与目标

在获取响应变量代价较高时，激励对线性回归中的响应进行子采样。
分析标准体积采样的性能并识别其在小样本量下的局限性。
开发一种重新缩放的体积采样方法，保持无偏性并改进尾部界限。
提供一种高效算法（行列式拒绝采样）以实现杠杆体积采样。
建立理论界限，显示无偏估计量的样本复杂度接近最优。

提出的方法

引入 q-重缩放体积采样并证明其对任意重缩放 q 的无偏性。
证明 Cauchy-Binet 公式的新扩展，用于计算重缩放体积采样的归一化。
使用基于杠杆分数的 q 开发行列式拒绝采样，以高效地产生样本。
证明利用杠杆分数能够得到无偏估计量并获得有利的矩阵尾部界限。
推导在高概率下实现 1+ε 损失界的样本复杂度 k = O(d log d + d/ε)。

实验结果

研究问题

RQ1在最坏情况数据下，标准体积采样是否为小样本量提供 1+ε 损失保证？
RQ2我们能否修改体积采样在保持无偏性的同时提升在小 k 时的性能？
RQ3哪种重新缩放策略能在保持线性回归子采样无偏性的同时改善尾部界限？
RQ4是否能设计一种高效算法来实现新的重新缩放体积采样？
RQ5要在高概率下实现 1+ε 的近似，需要的样本复杂度是多少？

主要发现

标准体积采样在小 k 时可能表现不佳，在某些构造上实现的损失严格大于最优解。
重新缩放的体积采样（杠杆体积采样）为最小二乘解提供对任意重缩放 q 的无偏估计量。
当 q 与杠杆分数成正比时，无偏性偏差消失，并使高效的拒绝采样算法成为可能。
杠杆体积采样在样本量 k = O(d log d + d/ε) 下实现乘法尾部界限。
所提出的行列式拒绝采样算法在高概率下大致以时间 O((d^2 + k)d^2 log(1/δ)) 运行，并且它使用基于杠杆分数的重缩放以提高效率。
该方法将已知的无偏样本界从 k = O(d^2/ε) 提升到 k = O(d log d + d/ε)。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。