QUICK REVIEW

[论文解读] Adjusting for Chance Clustering Comparison Measures

Simone Romano, Nguyễn Xuân Vinh|arXiv (Cornell University)|Dec 3, 2015

Statistical Mechanics and Entropy参考文献 39被引用 135

一句话总结

本文提出了一种基于Tsallis熵的广义信息论度量的统一框架，通过调整偶然性因素来改进聚类比较度量。该框架可解析计算这些度量的期望值和方差，从而生成广义的调整指数，涵盖如ARI和AMI等知名指标，并根据聚类结构提供基于证据的最优使用指南。

ABSTRACT

Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair-counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair-counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.

研究动机与目标

弥合成对计数度量（如ARI）与信息论度量（如AMI）之间调整方法的差距。
解决在随机聚类下，对广义信息论度量的期望值和方差进行解析计算的技术挑战。
开发一个广义的调整度量家族，使其在特定情况下退化为已知指标（如ARI和AMI）。
基于参考聚类的结构，提供数据驱动的指南，以决定在ARI与AMI之间如何选择。

提出的方法

引入一类通用度量 $\mathcal{L}_{\phi}$，其中广义信息论度量基于Tsallis $q$-熵作为特例。
在随机、独立聚类的零假设下，推导 $\mathcal{L}_{\phi}$ 中度量的期望值和方差的解析表达式。
通过z分数标准化（如SMI$_q$、SVI$_q$）提出标准化调整方法，以纠正聚类比较中的基线偏差和选择偏差。
利用泰勒展开和柯西-施瓦茨不等式，对方差进行上界估计，并证明在样本量较大时，其渐近收敛于零。
定义更广泛的族 $\mathcal{N}_{\phi}$，当对象数量较大时，可对其中度量的渐近期望进行近似。
应用坎泰利不等式，推导出调整度量的保守p值，用于统计显著性检验。

实验结果

研究问题

RQ1调整兰德指数（ARI）与调整互信息（AMI）的最佳应用场景是什么？
RQ2能否开发一个统一的解析框架，用于对成对计数度量和信息论度量进行偶然性调整？
RQ3在随机聚类下，广义信息论度量的期望值和方差能否被解析计算？
RQ4标准化在多大程度上能减少聚类比较度量中的选择偏差？
RQ5能否推导出广义的调整度量，使其作为ARI和AMI的特例？

主要发现

在随机、独立聚类的零假设下，基于Tsallis $q$-熵的广义信息论度量的期望值和方差可被解析计算。
所提出的广义调整度量（如SMI$_q$和SVI$_q$）等价于标准z分数，并在 $q \to 1$ 时退化为ARI和AMI。
该框架首次实现了对成对计数度量的统计标准化，纠正了聚类比较中的选择偏差。
当参考聚类不平衡且存在小簇时，推荐使用AMI；当簇较大且大小相等时，推荐使用ARI。
随着对象数量 $N$ 增加，广义度量的方差趋于零，确保了调整指数的渐近稳定性。
可利用坎泰利不等式计算保守p值，为聚类相似性的统计显著性提供检验方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。