QUICK REVIEW

[论文解读] Data thinning for convolution-closed distributions

Anna Neufeld, Ameer Dharamshi|arXiv (Cornell University)|Jan 18, 2023

Machine Learning and Algorithms被引用 14

一句话总结

本论文提出数据变薄（data thinning），一种将卷积闭合分布的单次观测分解为两个部分（或更多）独立部分，这些部分相互独立且和等于原始观测，并且每个遵循相同分布，在已知参数缩放下，从而在没有传统样本分割的情况下实现训练/测试验证。

ABSTRACT

We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.

研究动机与目标

在样本分割不可行时，说明需要验证工具的需求。
将数据变薄定义为一种原理性地将单个观测分解为独立部分的过程，这些部分在相同分布的前提下（经缩放）仍保持相同分布。
将变薄扩展到多折以及一大类卷积闭合分布。
将数据变薄与样本分割和数据分裂进行比较，并展示在聚类、低秩近似和单细胞RNA测序分析中的应用。

提出的方法

定义卷积闭合分布及线性期望性质。
给出算法1，将来自 F_lambda 的单个观测 X 变薄为 X^(1) 与 X^(2)，其中 X^(1) ~ F_{epsilon lambda}，X^(2) ~ F_{(1-epsilon) lambda}，相互独立且相加等于 X。
在定理1中证明，变薄保持分布形式及独立性，在线性期望性质下，E[X^(1)] = epsilon E[X]，E[X^(2)] = (1-epsilon) E[X]。
通过算法2和定理2将变薄扩展为多折（M）变薄，结果为 X^(m) ~ F_{epsilon_m lambda}，且各折之间相互独立，且和等于 X。
讨论实际因素包括未知的干扰参数，并在表2和表3中汇总常见分布的变薄细节。
将变薄与样本分割和数据分裂进行比较，并概述在何种情形下变薄具有优势。

Figure 1: Left: We generate 100,000 realizations of $X\sim\mathrm{N}(7,5)$ . For 50 values of $\tilde{\sigma}^{2}$ , we thin $X$ into $X^{(1)}$ and $X^{(2)}$ using $\tilde{\sigma}^{2}$ instead of $\sigma^{2}=5$ . Center: We generate 100,000 realizations of $X\sim\mathrm{NB}(7,0.7)$ . For 50 values o

实验结果

研究问题

RQ1卷积闭合分布的单个观测能否分解为独立部分，使其在参数缩放下复制原始分布？
RQ2如何将变薄从两个部分扩展到多折，同时保持独立性和边缘分布？
RQ3在何种情形下，数据变薄为模型验证与推断提供实际的替代样本分割的方法？
RQ4未知干扰参数对变薄有何影响，变薄对这种错误设定的鲁棒性如何？
RQ5变薄在验证聚类、低秩矩阵近似和单细胞RNA测序分析中的表现如何？

主要发现

数据变薄产生了两个独立分量 X^(1) 和 X^(2)，使 X = X^(1) + X^(2)，且 X^(1) ~ F_{epsilon lambda}，X^(2) ~ F_{(1-epsilon) lambda}。
当基分布满足线性期望性质时，E[X^(1)] = epsilon E[X]，E[X^(2)] = (1-epsilon) E[X]。
多折变薄可以推广到任意 M，使 X^(m) ~ F_{epsilon_m lambda}，各折之间相互独立，且和等于 X。
该框架适用于比高斯和泊松更广泛的一类卷积闭合分布，包括伽马、负二项、二项和多项分布族。
变薄可用于类似交叉验证的评估，而无需传统的样本分割，并在仿真和单细胞RNA测序数据验证场景中给出示例。
本文分析了干扰参数设定错误对变薄的影响，并就跨折的信息分配参数（epsilon）的选择提供了指南。

Figure 2: Comparison of data thinning and sample splitting, using the detection and power metrics defined in Section 4.3 . The top row shows the results of the large $n$ setting where the observations are independent and identically distributed (iid), and thus data thinning and sample splitting achi

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。