QUICK REVIEW

[论文解读] Outlier Robust Multivariate Polynomial Regression

Vipul Arora, Arnab Bhattacharyya|arXiv (Cornell University)|Jan 1, 2024

Advanced Statistical Methods and Models被引用 1

一句话总结

本文提出了一种鲁棒的多元多项式回归算法，可容忍高达一半的数据为任意异常值，在切比雪夫测度下使用 O(n d^n log d) 个样本或在均匀分布下使用 O(n d^{2n} log d) 个样本时，实现 ℓ∞-误差为 O(σ)。该方法利用结构化的多项式基函数，并通过基于节点的多项式分区实现异常值鲁棒拟合，样本复杂度的最优性通过信息论下界证明。

ABSTRACT

We study the problem of robust multivariate polynomial regression: let $p\colon\mathbb{R}^n o\mathbb{R}$ be an unknown $n$-variate polynomial of degree at most $d$ in each variable. We are given as input a set of random samples $(\mathbf{x}_i,y_i) \in [-1,1]^n imes \mathbb{R}$ that are noisy versions of $(\mathbf{x}_i,p(\mathbf{x}_i))$. More precisely, each $\mathbf{x}_i$ is sampled independently from some distribution $χ$ on $[-1,1]^n$, and for each $i$ independently, $y_i$ is arbitrary (i.e., an outlier) with probability at most $ρ< 1/2$, and otherwise satisfies $|y_i-p(\mathbf{x}_i)|\leqσ$. The goal is to output a polynomial $\hat{p}$, of degree at most $d$ in each variable, within an $\ell_\infty$-distance of at most $O(σ)$ from $p$. Kane, Karmalkar, and Price [FOCS'17] solved this problem for $n=1$. We generalize their results to the $n$-variate setting, showing an algorithm that achieves a sample complexity of $O_n(d^n\log d)$, where the hidden constant depends on $n$, if $χ$ is the $n$-dimensional Chebyshev distribution. The sample complexity is $O_n(d^{2n}\log d)$, if the samples are drawn from the uniform distribution instead. The approximation error is guaranteed to be at most $O(σ)$, and the run-time depends on $\log(1/σ)$. In the setting where each $\mathbf{x}_i$ and $y_i$ are known up to $N$ bits of precision, the run-time's dependence on $N$ is linear. We also show that our sample complexities are optimal in terms of $d^n$. Furthermore, we show that it is possible to have the run-time be independent of $1/σ$, at the cost of a higher sample complexity.

研究动机与目标

设计一种高效的多元多项式回归算法，即使在训练样本中高达一半为对抗性异常值时仍能保持准确性。
在噪声和异常值污染的采样下，最小化学习 n 元多项式（个体次数至多为 d）的样本复杂度。
在异常值比例 ρ < 1/2 的情况下，实现常数近似因子，且 ℓ∞-误差被 O(σ) 有界，与异常值比例无关。
证明紧致的样本复杂度下界，表明任何算法以常数概率成功至少需要 Ω((cd)^n log d) 个样本。
将先前的单变量鲁棒回归结果（Kane 等，FOCS'17）扩展到多元设置，实现最优的样本和运行时间效率。

提出的方法

使用源自切比雪夫型多项式的结构化多项式基，构建具有受控 ℓ∞-范数行为的局部逼近函数。
将域 [−1,1]^n 划分为 m^n 个点的网格，其中 m = ⌊d^{α/2}⌋，以在每个节点 bj 处定义局部多项式 pbj。
应用引理 7.6，将每个局部多项式 pbj(x) 的大小控制在距离最近节点距离的 O(1/d) 倍以内，确保其局部支撑。
为节点子集 S 定义全局拟合函数 fS(x) = ∑_{j∈S} pbj(x)，利用三角不等式和节点邻近性控制全局误差。
通过利用异常值不太可能集中在小而孤立区域的特性，实施异常值鲁棒拟合策略，借助概率集中性实现误差界。
通过使用两个候选多项式 fS 和 fS′（仅在单个节点的多项式上不同）的统计不可区分性论证，证明样本复杂度下界。

实验结果

研究问题

RQ1能否在个体次数 d 的多元多项式中，实现样本复杂度关于 n 的亚指数级，以实现鲁棒多元多项式回归？
RQ2在多元设置中，切比雪夫和均匀分布下的鲁棒回归最优样本复杂度是多少？
RQ3该算法能否在 ℓ∞-范数下实现 O(σ) 的误差，同时对 ρ < 1/2 的对抗性异常值保持鲁棒？
RQ4所提出的样本复杂度是否紧致，或可进一步渐近优化？
RQ5能否在不增加样本复杂度的前提下，使运行时间与 1/σ 无关？

主要发现

在 n 维切比雪夫测度下，该算法实现多元多项式回归的 ℓ∞-误差为 O(σ)，样本复杂度为 O(n d^n log d)。
在均匀分布下，样本复杂度增加至 O(n d^{2n} log d)，但相对于对数因子仍为最优。
运行时间与输入数据的位精度 N 线性相关，与 1/σ 对数相关；若允许更高样本复杂度，可实现与 1/σ 无关的运行时间。
样本复杂度被证明在对数因子内最优，因为任何算法至少需要 (cd)^n log d 个样本才能以常数概率成功，其中 c = c(C, ρ) > 0。
下界表明，任何算法在少于 (cd)^n log d 个样本下成功概率无法超过 2/3，从而在 d^n 的意义上证明了样本复杂度的紧致性。
该方法证明，在给定采样模型下，即使异常值在看到所有样本后才被对抗性选择，ℓ∞-误差仍被 O(σ) 有界。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。