QUICK REVIEW

[论文解读] Generalization in Adaptive Data Analysis and Holdout Reuse

Cynthia Dwork, Vitaly Feldman|arXiv (Cornell University)|Jun 8, 2015

Privacy-Preserving Technologies in Data参考文献 35被引用 101

一句话总结

本文提出了 Thresholdout，一种实用的算法，通过结合差分隐私和描述长度原理，防止过拟合，从而实现自适应数据分析中验证集的安全重复使用。该方法在自适应查询下仍能提供可证明的泛化保证，即使在假设被自适应选择的情况下也优于标准验证集方法，在合成实验中保持了准确的性能估计且无过拟合现象。

ABSTRACT

Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions. In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout set via a simple synthetic experiment. We also formalize and address the general problem of data reuse in adaptive data analysis. We show how the differential-privacy based approach given in (Dwork et al., 2014) is applicable much more broadly to adaptive data analysis. We then show that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that these incomparable approaches can be unified via the notion of approximate max-information that we introduce.

研究动机与目标

解决自适应数据分析中因在多个与数据相关的查询中重复使用验证集而导致的过拟合问题。
开发一种实用方法，使研究人员能够在不损害统计有效性的前提下，对单个验证集上的假设进行验证。
将两种不同的理论方法——差分隐私与描述长度——统一到一个共同框架中，以提供泛化保证。
形式化并解决自适应数据分析中数据重复使用这一更广泛的问题，确保最终输出能泛化到基础数据分布。

提出的方法

提出 Thresholdout 算法，利用差分隐私机制在保持对自适应查询低敏感性的同时，估计假设在验证集上的准确性。
采用阈值机制，比较模型在训练集和验证集上的经验准确率，仅当差异低于预设阈值时才返回稳定估计。
引入近似最大信息量（approximate max-information）作为统一度量，用于分析和组合具有不同泛化保证的算法。
结合差分隐私与描述长度边界，为自适应环境提供互补但不可比较的泛化保证。
利用验证集来验证模型性能，同时确保验证过程本身不会因自适应重复使用而成为过拟合的来源。
采用两阶段验证流程：首先检查模型在验证集上的表现是否与在训练集上的表现一致；其次，仅在一致性达成时返回稳定估计。

实验结果

研究问题

RQ1在自适应数据分析中，是否可以安全地多次重复使用单个验证集而不导致过拟合？
RQ2当假设基于先前结果自适应选择时，如何保持泛化保证？
RQ3差分隐私与描述长度在确保自适应分析中统计有效性方面有何关系？
RQ4在保持各自保证的前提下，是否可以组合不同的泛化技术（如差分隐私与描述长度）？
RQ5在自适应验证集重复使用的情境下，是否可能获得强于差分隐私所提供的泛化保证？

主要发现

在合成实验中，Thresholdout 成功防止了对验证集的过拟合，即使在多次自适应查询下仍能保持对分类器性能的准确估计。
在变量之间无相关性的实验中，标准验证集表现出显著的过拟合，而 Thresholdout 提供了稳定且准确的泛化误差估计。
当变量与标签相关时，Thresholdout 仍能发现真实模式并避免过拟合，展示了其在现实场景中的鲁棒性。
Thresholdout 报告的准确率与在全新独立测试集上的真实准确率非常接近，表明该方法未对验证数据过拟合。
该方法允许分析人员基于验证估计做出进一步与数据相关的决策，而不会损害统计有效性。
理论分析表明，差分隐私与描述长度两种方法提供的泛化保证不可比较但具有互补性，而近似最大信息量的概念使得两者的结合成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。