Skip to main content
QUICK REVIEW

[论文解读] Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Aaditya Ramdas, Sashank J. Reddi|arXiv (Cornell University)|Aug 4, 2015
Statistical Methods and Inference参考文献 31被引用 23
一句话总结

本文在高维设定下建立了基于核函数(gMMD)与基于距离(eED)的两样本检验之间的理论联系,表明它们在均值差异替代假设(MDA)下渐近达到相等且最优的统计功效,同时对一般分布差异(GDA)仍保持一致性。研究揭示了一种平滑的计算-统计权衡,计算量增加可提升统计功效,并证明gMMD的性能在带宽选择上对超出中位数启发式方法的范围具有鲁棒性。

ABSTRACT

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power.

研究动机与目标

  • 理解当分布仅在均值上不同时(MDA),通用核函数检验(gMMD)与基于距离的检验(eED)相较于专门设计的高维t检验的统计功效。
  • 刻画gMMD与eED在高维MDA下的渐近行为,特别是关于统计功效、方差与带宽依赖性。
  • 建立这些检验在从线性时间、亚二次时间到二次时间变体之间的平滑计算-统计权衡。
  • 为gMMD中带宽选择的中位数启发式方法提供理论依据。
  • 证明在球形协方差结构下,gMMD与eED对MDA具有渐近最优性,其极限功效与下界一致,且常数相同。

提出的方法

  • 使用U统计量理论与埃爾米特多项式展开,推导在原假设与备择假设下检验统计量的渐近分布。
  • 对高斯核与修正的欧几里得距离应用泰勒展开,以在MDA下近似高阶矩。
  • 利用高斯向量中二次型的迹渐近分析与矩界,刻画检验统计量的均值与方差。
  • 证明当信号-噪声比高时,gMMD与eED在备择假设下的方差为O(1/n),而在原假设下为O(1/n²),与高维退化U统计量一致。
  • 将gMMD与eED的渐近功效与高维t检验在MDA下的功效进行比较,表明其极限功效与常数完全等价。
  • 证明只要带宽超过中位数启发式选择的值,gMMD的功效对核函数带宽不敏感。

实验结果

研究问题

  • RQ1在均值差异替代假设(MDA)下,gMMD与eED对一般差异替代假设(GDA)的统计功效,与专门设计的高维t检验相比如何?
  • RQ2是否存在对gMMD中带宽选择使用中位数启发式方法的理论依据?
  • RQ3核函数与基于距离的两样本检验中,计算成本与统计功效之间的关系是什么?
  • RQ4在球形协方差结构下,gMMD与eED是否对MDA具有渐近最优性?
  • RQ5在高维MDA下,gMMD与eED的方差与渐近分布行为如何?

主要发现

  • gMMD与eED在MDA下渐近功效相等,且在相同条件下与专门设计的高维t检验的功效一致。
  • 只要带宽超过中位数启发式选择的值,gMMD的功效即与核函数带宽无关。
  • 在球形协方差结构下,gMMD与eED对MDA具有渐近最优性,其极限功效与常数与下界完全一致。
  • 存在一种平滑的计算-统计权衡:计算成本从线性时间增加到二次时间,可直接带来统计功效的量级提升。
  • gMMD与eED在备择假设下的方差为O(1/n),而在原假设下为O(1/n²),与高维退化U统计量一致。
  • 理论分析证实,gMMD与eED对MDA具有自适应性,对GDA保持一致性,且对MDA具有最优功效,无需重新参数化。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。