QUICK REVIEW

[论文解读] More Data Can Hurt for Linear Regression: Sample-wise Double Descent

Preetum Nakkiran|arXiv (Cornell University)|Dec 16, 2019

Random Matrices and Applications参考文献 12被引用 42

一句话总结

论文分析带等方高斯协变量的过参数化线性回归，并表明测试风险在样本数的变化中可能非单调，在接近 n = d 处达到峰值，原因是偏差-方差权衡。

ABSTRACT

In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.

研究动机与目标

激发对过参数化线性模型中非单调测试风险的理解。
隔离在增加数据时会损害性能的样本量区间。
提供直觉和近似的偏差-方差表达式来解释这一现象。

提出的方法

研究最小范数无岭回归估计量，等价于对最小二乘的梯度下降。
将超参数风险分解为偏差和方差分量并推导近似表达式 B_n 和 V_n。
分析数据矩阵 X 的条件数及其对迹项 Tr((XX^T)^{-1}) 的影响。
使用等方高斯协变量 X ~ N(0,I_d) 且 y = ⟨x,β⟩ + η，且 ||β||_2 ≤ 1。
给出在 n ≤ d 区间的闭式近似的断言，并为 n > d 的欠参数化结果提供参考。

实验结果

研究问题

RQ1在固定维度 d 的情况下，最小范数插值估计量的测试风险随样本数 n 的变化表现为何？
RQ2在过参数化区间（n ≤ d）中，超参数化的偏差和方差贡献分别是什么？
RQ3为什么在临界区间 n ≈ d 时数据矩阵 X 变得条件数差，从而导致方差上升？
RQ4增加一个样本如何影响迹项 Tr((XX^T)^{-1}) 及总体风险？
RQ5理论近似是否与有限 d（例如 d = 1000）下的经验观测相符？

主要发现

测试风险在 n 上非单调；先下降，在 n = d 处达到峰值，然后当 n 增加到 d 以上时再次下降。
在过参数化区间，偏差 B_n 随 n 减少，而方差 V_n 增加并在临界点处主导。
对 γ = n/d < 1 的近似超额风险为 E[R̄(β̂)] ≈ (1 − γ)||β||^2 + σ^2 γ/(1−γ)。
风险的峰值与 n ≈ d 时 X 的条件数有关，导致噪声项 X^†η 的高范数膨胀。
方差中的迹项在 d 随 n = γd 增大时满足 Tr((XX^T)^{-1}) → γ/(1−γ)，解释了方差的激增。
给出 n ≤ d 区间的精确有限样本偏差和方差表达。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。