QUICK REVIEW

[论文解读] Unknown Examples & Machine Learning Model Generalization.

Yeounoh Chung, Peter J. Haas|arXiv (Cornell University)|Aug 24, 2018

Machine Learning and Data Classification参考文献 34被引用 24

一句话总结

本文提出一种方法，通过估计并合成因协变量偏移或采样偏差而缺失的‘未知的未知’——即训练数据中缺失的样本——来提升机器学习模型的泛化能力。该方法基于多源训练数据使用物种估计和数据驱动的特征建模技术，无需在训练时使用测试数据，即可提升模型的鲁棒性并降低泛化误差。

ABSTRACT

Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, more generally, covariate shift, i.e., a distribution shift between the training and deployment stage. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence biased predictions. We provide novel algorithms that estimate the number and properties of these unknown training examples---unknown unknowns. This information can then be used to correct the training set, prior to seeing any test data. The key idea is to combine species-estimation techniques with data-driven methods for estimating the feature values for the unknown unknowns. Experiments on a variety of ML models and datasets indicate that taking the unknown examples into account can yield a more robust ML model that generalizes better.

研究动机与目标

解决由于训练数据与测试数据分布差异导致的模型泛化能力差的问题，即协变量偏移和采样偏差的影响。
检测并建模因数据收集偏差而系统性缺失的训练样本（即未知的未知）。
开发一种方法，在训练阶段无需访问测试数据的情况下，提升模型的鲁棒性和泛化性能。
提供一种实用的、基于数据驱动的方法，利用重叠的数据源校正训练数据分布。

提出的方法

使用物种估计技术来估计训练数据中稀有或缺失的数据类型（物种）的数量。
应用数据驱动方法，基于观测到的数据模式推断未知未知样本的合理特征值。
利用核密度估计（KDE）和基于SMOTE的方法，为缺失的数据类型生成逼真的训练样本。
在模型训练前，通过整合这些合成的未知样本对训练集进行校正。
基于训练数据与测试数据之间条件类别分布 p(y|x) 保持一致的假设。
在训练阶段无需未标记的测试数据，也无需了解真实测试分布。

实验结果

研究问题

RQ1当测试数据在训练阶段不可用时，如何使机器学习模型对协变量偏移更具鲁棒性？
RQ2在无法访问测试数据的情况下，哪些技术能有效估计缺失训练样本（未知的未知）的特征值？
RQ3为未知未知样本生成合成数据是否能提升模型的泛化性能？
RQ4在处理协变量偏移时，不同合成数据生成方法（KDE 与 SMOTE）的性能如何比较？
RQ5在何种条件下，学习未知未知样本能显著提升模型性能？

主要发现

在NBA球员身高-体重回归任务中，SynUnk (KDE) 将泛化误差（Ge）降低至所有方法中的最低水平。
所提出的方法在MovieLens数据集上提升了泛化性能，其泛化误差低于基线方法。
合成的未知样本未导致性能下降，反而在许多情况下提升了性能，即使在训练阶段未使用测试数据。
该方法对保守估计具有鲁棒性，当未知未知样本不高度集中时，性能损失极小。
结果表明，即使训练良好的模型在协变量偏移下仍可能失效，凸显了主动检测未知未知样本的必要性。
该方法能有效缓解因某些数据类型系统性代表性不足而产生的偏差，且无需了解目标分布。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。