QUICK REVIEW

[论文解读] Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration

Mei Song, Theodor Misiakiewicz|arXiv (Cornell University)|Jan 26, 2021

Stochastic Gradient Optimization Techniques参考文献 39被引用 18

一句话总结

本文在核的谱条件和超收缩性条件下，对随机特征和核岭回归的一般化误差提供了精确刻画。结果表明，仅当特征维数 $N$ 超过 $n^{1+\theta}$（其中 $\theta>0$）时，随机特征岭回归才能近似核岭回归；当 $N \leq n^{1-\delta}$ 时，测试误差主要由近似误差主导，且两种方法之间的误差差距被精确量化。

ABSTRACT

Consider the classical supervised learning problem: we are given data $(y_i,{\boldsymbol x}_i)$, $i\le n$, with $y_i$ a response and ${\boldsymbol x}_i\in {\mathcal X}$ a covariates vector, and try to learn a model $f:{\mathcal X} o{\mathbb R}$ to predict future responses. Random features methods map the covariates vector ${\boldsymbol x}_i$ to a point ${\boldsymbol ϕ}({\boldsymbol x}_i)$ in a higher dimensional space ${\mathbb R}^N$, via a random featurization map ${\boldsymbol ϕ}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${\mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $\ell$ eigenfunctions of the kernel, where $\ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $N\le n^{1-δ}$ for some $δ>0$. We characterize this gap. For $N\ge n^{1+δ}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error.

研究动机与目标

理解高维设定下随机特征岭回归（RFRR）的一般化误差。
确定使 RFRR 达到核岭回归（KRR）性能所需的最小特征维数 $N$。
建立 RFRR 以受控误差近似 KRR 的条件。
刻画 RFRR 中近似误差与估计误差之间的权衡。

提出的方法

作者定义了一类满足特征函数上谱条件和超收缩性的核，该类核在经典高维模型中可被验证。
将 RFRR 视为 KRR 的有限维近似，重点关注核算子前 $\ell$ 个特征函数的作用。
分析利用了随机特征矩阵的测度集中性，以及基于谱分解对经验核矩阵的界。
关键技术工具包括球面和超立方体上的超收缩不等式，用于控制多项式特征函数的高阶矩。
该方法通过将一般化误差分解为近似与估计两部分，结合矩阵集中性与特征值分析，导出紧致界。
理论结果在两个典型例子上得到验证：二元超立方体与单位球面，其中超收缩性成立。

实验结果

研究问题

RQ1在核的谱条件与超收缩性假设下，核岭回归的一般化误差是什么？
RQ2随机特征岭回归要达到与核岭回归相同的一般化误差，其特征维数 $N$ 至少需要多大？
RQ3随机特征岭回归中的主导误差来源是近似误差还是估计误差？
RQ4随机特征的近似误差如何依赖于样本量 $n$ 与特征维数 $N$？
RQ5能否通过核矩阵集中性与超收缩性，对 RFRR 的一般化误差进行紧致界估计？

主要发现

当 $N \leq n^{1-\delta}$（对任意 $\delta > 0$）时，随机特征岭回归的一般化误差主要由近似误差主导，且严格大于 KRR 的误差。
当 $N \geq n^{1+\delta}$ 时，RFRR 的测试误差与 KRR 仅相差一个常数因子，进一步增加 $N$ 不会显著降低误差。
KRR 可通过投影到核算子前 $\ell$ 个特征函数而良好近似，其中 $\ell \asymp n$。
RFRR 与 KRR 之间的一般化误差差距被定量刻画，当 $N \leq n^{1-\delta}$ 时，其量级为 $O(n^{-\delta})$（$\delta > 0$）。
底层测度（高斯分布、球面上均匀分布或超立方体）的超收缩性保证了 $\ell$ 次特征函数满足 $\|f\|_{L^q}^2 \leq (q-1)^\ell \|f\|_{L^2}^2$，从而实现矩控制。
经验核矩阵集中在其期望附近，且其前导特征向量以高概率与真实核特征向量对齐，从而实现稳定近似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。