QUICK REVIEW

[论文解读] Estimating the unseen from multiple populations

Chiesa, Alessandro, Gur, Tom|arXiv (Cornell University)|Jul 12, 2017

Genomic variations and chromosomal abnormalities参考文献 9被引用 5

一句话总结

本文提出了一种新颖的框架，用于在多个群体中估计未观测到的元素，将 Good-Toulmin 估计器推广至多群体设置。该框架提出了一种加权线性估计器，其精度与群体数量无关，并采用基于直方图的优化方法，实现精确的外推和预算感知的队列设计，显著提升了基因组学及其他领域中的发现效率。

ABSTRACT

Distribution testing is an area of property testing that studies algorithms that receive few samples from a probability distribution D and decide whether D has a certain property or is far (in total variation distance) from all distributions with that property. Most natural properties of distributions, however, require a large number of samples to test, which motivates the question of whether there are natural settings wherein fewer samples suffice. We initiate a study of proofs of proximity for properties of distributions. In their basic form, these proof systems consist of a tester that not only has sample access to a distribution but also explicit access to a proof string that depends on the distribution. We refer to these as NP distribution testers, or MA distribution testers if the tester is a probabilistic algorithm. We also study the more general notion of IP distribution testers, in which the tester interacts with an all-powerful untrusted prover. We investigate the power and limitations of proofs of proximity for distributions and chart a landscape that, surprisingly, is significantly different from that of proofs of proximity for functions. Our main results include showing that MA distribution testers can be quadratically stronger than standard distribution testers, but no stronger than that; in contrast, IP distribution testers can be exponentially stronger than standard distribution testers, but when restricted to public coins they can be at best quadratically stronger.

研究动机与目标

填补当数据来自多个具有各自分布的独立群体时，未观测估计的空白。
开发一种方法，以估计在额外采样个体后，所有群体中预期出现的新元素数量。
实现采样预算在多个群体间的最优分配，以最大化新元素的发现。
提供一个通用框架，用于估计多个群体之间的联合频率分布，以支持多样化的统计预测。

提出的方法

提出一种加权线性估计器 $ \hat{U}^W $，用于估计在多群体外推后预期出现的总新元素数量。
证明该估计器的精度与群体数量 $ m $ 无关，并实现最优的超线性外推速率。
通过约束优化（使用 $ \hat{H}_{\text{count}} $ 和 $ \hat{H}_{\text{ll}} $）引入基于直方图的估计方法，以恢复多群体间的联合频率分布。
将直方图估计建模为凸优化问题，强制与观测到的频率计数保持一致，并最小化与均匀先验的偏差。
利用估计的直方图预测未观测统计量，如至少出现两次的新元素数量或最多出现三次的新元素数量。
将直方图估计器应用于在固定预算下优化采样分配，以最大化预期的新元素发现数量。

实验结果

研究问题

RQ1我们如何将 Good-Toulmin 估计器推广至具有不同分布的多群体设置中，以估计未观测到的元素？
RQ2此类估计器的理论精度如何？其精度是否依赖于群体数量？
RQ3我们能否估计多群体间的完整联合频率分布，以支持更丰富的统计预测？
RQ4我们如何在固定采样预算下，最优地分配采样资源至多个群体，以最大化新元素的发现？
RQ5在高外推场景下，基于直方图的估计方法相较于线性估计器能有多大程度的性能提升？

主要发现

所提出的加权线性估计器 $ \hat{U}^W $ 实现了与群体数量 $ m $ 无关的精度，并且在最坏情况下达到最优。
在外推因子达到 10 以内时，加权线性估计器在均匀分布、狄利克雷分布和几何分布下均实现了 0.08–0.09 的均方误差。
在低样本条件下，直方图估计器 $ \hat{H}_{\text{count}} $ 和 $ \hat{H}_{\text{ll}} $ 显著优于经验直方图和线性估计器。
在合成数据上，$ \hat{H}_{\text{count}} $ 和 $ \hat{H}_{\text{ll}} $ 在平衡和偏斜采样分布下均实现了近乎完美的未观测元素预测精度。
在真实人类基因组数据上，使用 $ \hat{H}_{\text{count}} $ 进行预算分配，相比均匀或有偏分配，使新变异的发现数量增加了 10%。
基于直方图的方法准确预测了在新样本中至少出现两次的新变异数量，证明了其在简单未观测元素计数之外的实用价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。