Skip to main content
QUICK REVIEW

[论文解读] Statistical power for cluster analysis

E. S. Dalmaijer, C. L. Nord|arXiv (Cornell University)|Mar 1, 2020
Statistical Methods and Bayesian Inference被引用 26
一句话总结

本文提出一种基于模拟的框架,用于估计聚类分析的统计功效,评估子群大小、分离度(效应量)和协方差结构对常见算法功效的影响。研究发现,当效应量较大(Δ=4)或多个小效应在特征间累积时,每个子群N=20–30即可获得足够的功效;对于重叠的多元正态分布,模糊聚类或有限混合模型优于k-means。

ABSTRACT

Cluster algorithms are increasingly popular in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and accuracy for common analysis pipelines through simulation. We varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction (none, multidimensional scaling, or UMAP) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent profile and latent class analysis). We found that outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large (Δ=4). Fuzzy clustering provided a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ=3). Overall, we recommend that researchers 1) only apply cluster analysis when large subgroup separation is expected, 2) aim for sample sizes of N=20 to N=30 per expected subgroup, 3) use multidimensional scaling to improve cluster separation, and 4) use fuzzy clustering or finite mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

研究动机与目标

  • 解决生物医学研究中聚类分析缺乏既定的先验功效分析方法的问题。
  • 评估子群大小、子群数量、效应量(分离度)和协方差结构对统计功效的影响。
  • 比较离散聚类(k-means)、模糊聚类(c-means)和有限混合模型(潜在类别/轮廓分析)方法的性能。
  • 为聚类分析中的样本量和算法选择提供基于证据的建议。
  • 评估降维技术(MDS、UMAP)在提升聚类分离度和功效方面的作用。

提出的方法

  • 生成具有受控子群大小、分离度(Δ)和协方差结构的模拟多元正态数据集。
  • 应用三种降维技术:无降维、多维缩放(MDS)和UMAP。
  • 评估六种聚类算法:k-means、Ward或平均联接的层次聚类,以及基于欧几里得或余弦距离的算法。
  • 扩展分析,纳入模糊c-means和有限混合模型(潜在轮廓分析和潜在类别分析)。
  • 将统计功效定义为正确识别真实子群数量的模拟比例。
  • 使用受试者工作特征曲线(ROC)和调整兰德指数(ARI)评估聚类准确性。

实验结果

研究问题

  • RQ1在生物医学数据的聚类分析中,达到足够统计功效所需的样本量是多少?
  • RQ2聚类分离度(效应量Δ)在多大程度上影响检测真实子群的能力?
  • RQ3不同聚类算法(k-means、c-means、有限混合模型)在功效和准确性方面如何比较?
  • RQ4降维技术(MDS或UMAP)在多大程度上能提升聚类检测功效?
  • RQ5不同的协方差结构如何影响聚类分析的性能?

主要发现

  • 统计功效主要受大效应量(Δ=4)或特征间多个小效应累积影响的驱动。
  • 当聚类分离度较大(Δ=4)时,每个子群N=20即可实现足够的功效。
  • 对于中等分离度(Δ=3)的多元正态分布,模糊聚类(c-means)在功效和简洁性方面优于k-means。
  • 有限混合模型方法(潜在轮廓分析和潜在类别分析)在部分重叠分布下,比k-means更具功效和效率。
  • 在所有模拟条件下,协方差结构对聚类功效或准确性无显著影响。
  • 通过MDS进行降维可改善聚类分离度并增强功效,尤其在与模糊或混合模型结合时效果更显著。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。