QUICK REVIEW

[论文解读] The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial

Benyamin Ghojogh, Mark Crowley|arXiv (Cornell University)|May 28, 2019

Domain Adaptation and Few-Shot Learning被引用 149

一句话总结

本教程阐述了过拟合、交叉验证、正则化、装袋（bagging）和提升（boosting）背后的理论，使用 SURE 与偏差-方差分析，涵盖回归与分类。

ABSTRACT

In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are $K$-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both $\\ell_2$ and $\\ell_1$ norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.

研究动机与目标

定义随机变量及模型的均方误差、方差、协方差和偏差。
用SURE（Stein’s Unbiased Risk Estimator）区分真误差与泛化误差。
介绍交叉验证（K 折和留一法）和广义交叉验证；讨论正则化。
解释正则化（岭回归和套索）及其对模型复杂度的影响。
描述装袋与提升，包括 AdaBoost，并将提升与 SVM/最大边界思想联系起来。
提供机器学习和计算机视觉中的正则化与集成方法示例。

提出的方法

使用 SURE 形成训练与验证/测试数据的真误差与泛化误差。
推导并建立估计量（包括集成模型）的偏差、方差和均方误差之间的关系。
给出 K 折和 LOOCV 的程序，包含训练/测试划分的定义与防作弊警告。
通过 SURE 引入广义交叉验证和正则化，适用于 ℓ2 和 ℓ1 范数。
解释通过装袋实现的方差降低，并将提升与加法模型和 SVM 概念联系起来。
讨论实际的正则化技术（岭回归、套索、权重衰减、提前停止、噪音注入）以及集成方法（随机森林、 dropout、方向梯度直方图、单发多框检测器）。

实验结果

研究问题

RQ1培训和测试数据的真误差与泛化误差是什么，以及如何使用 SURE 无偏估计它们？
RQ2在回归和分类设置中，偏差、方差和均方误差如何相关，包括集成方法？
RQ3交叉验证策略（K 折、LOOCV）如何帮助防止过拟合并选择模型复杂度？
RQ4正则化、装袋和提升在控制模型复杂度与泛化方面扮演何种角色？
RQ5有哪些实际的解释与界限，说明为什么提升可以防止过拟合？

主要发现

给出随机变量与模型的均方误差、方差与偏差之间的定义与关系。
SURE 提供一个将训练误差与真实误差连接起来的框架，使对过拟合与正则化效应的分析成为可能。
对 K 折和留一交叉验证进行了形式化，提供数据划分及潜在舞弊情形的指南。
装袋被证明可以降低估计量的方差，示例涵盖随机森林、dropout，以及 ML/CV 中的交叉验证技术。
提升被讨论为一种加法模型和最大边界（类似 SVM）的方法，给出对泛化误差的上界以证明对抗过拟合的能力。
正则化（岭回归、套索、权重衰减、噪音注入、提前停止）在同一偏差-方差框架下进行考察。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。