QUICK REVIEW

[论文解读] Selective inference for k-means clustering

Yiqun T. Chen, Daniela Witten|PubMed|Mar 29, 2022

Single-cell and spatial transcriptomics参考文献 30被引用 23

一句话总结

本文开发了一个有限样本选择性推断p值，用于检验通过k-means识别的两个聚类之间均值差异，确保在不进行数据分割的情况下的选择性原假错控制。

ABSTRACT

We consider the problem of testing for a difference in means between clusters of observations identified via <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

研究动机与目标

动机：对由数据驱动聚类定义的聚类之间的均值差异进行检验。
解决在对基于聚类的假设进行检验时一类错误率膨胀的问题。
为k-means聚类开发一个有限样本的选择性推断框架。
提供一个对聚类结果进行条件化的精确p值计算。

提出的方法

将原假设形式化为 H0: μ^T ν = 0，代表来自k-means的两个估计聚类之间的差异。
开发一个选择性p值 p_selective，使其对由k-means算法生成的整个聚类路径进行条件化。
证明 p_selective 等于被截断到集合 S_T 的放缩 χ_q 变量的生存函数。
通过白化或已知 Σ 为非球形协方差提供扩展，得到调整后的 p 值 p_{Σ,selective}。
讨论通过使用一致估计量来处理未知方差 σ，以及相应的调整后 p 值。
在 R 软件包 KmeansInference 中实现该方法，并提供可复现的代码。

实验结果

研究问题

RQ1是否可以构造一个用于测试通过k-means获得的聚类之间均值差异的有限样本、基于选择性推断的p值？
RQ2在 H0 下对整个k-means聚类路径进行条件化是否能控制选择性第一类错误？
RQ3如何高效地计算选择性p值，并且它是否扩展到非球形协方差结构和未知方差？
RQ4该方法是否可实际应用于真实数据集（如手写数字、单细胞 RNA-seq）以在聚类后进行有效推断？

主要发现

忽略聚类的朴素检验会导致第一类错误膨胀。
提出的 p_selective 将选择性第一类错误控制在水平 α。
p 值可计算为一个放缩χ_q变量经截断的生存函数，需刻画集合 S_T。
扩展通过白化或已知 Σ 实现非球形协方差，得到调整后的 p 值 p_{Σ,selective}。
未知 σ 可以用一致估计量来处理，从而得到渐近的选择性第一类错误控制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。