[论文解读] From Subspaces to Metrics and Beyond: Toward Multi-Diversified Ensemble Clustering of High-Dimensional Data
本文提出了一种多维度多样化集成聚类框架,通过联合利用相似性度量和子空间的多样性来处理高维数据。通过随机化缩放指数核以生成多样化的度量,并将其与随机子空间配对,该方法构建了一个丰富的基聚类集合,在包括基因表达和图像/语音数据在内的30个高维数据集上实现了最先进性能。
The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multi-level diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this paper proposes a novel multi-diversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed. Further, an entropy-based criterion is utilized to explore the cluster-wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state-of-the-art. The source code is available at this https URL.
研究动机与目标
- 解决现有集成聚类方法在高维数据中对度量多样性关注不足的问题。
- 通过联合利用度量、子空间和聚类的多样性来克服维度灾难。
- 开发一个统一框架,以探索度量、子空间和聚类之间的多层次多样性。
- 通过基于熵的多样性评估与共识函数集成,提升聚类性能。
- 在包括癌症基因表达和图像/语音数据在内的多样化高维数据集上展示方法的有效性。
提出的方法
- 随机化一个缩放指数相似性核,以生成大量多样化的相似性度量。
- 将每个随机化的度量与随机选择的子空间配对,形成度量-子空间对。
- 从每个度量-子空间对构建相似性矩阵,以推导基聚类。
- 应用基于熵的准则,测量并利用集成中每个聚类的多样性。
- 将三种类型的共识函数整合到框架中,以生成最终聚类。
- 利用所得集成提升高维数据上的鲁棒性与准确性。
实验结果
研究问题
- RQ1如何有效生成多样化的相似性度量,以增强高维空间中的集成聚类?
- RQ2度量与子空间的联合多样性在多大程度上提升了聚类性能?
- RQ3基于熵的多样性度量能否有效指导基聚类的选择与组合?
- RQ4当与多维度多样化的度量-子空间对结合时,不同共识函数的表现如何?
- RQ5所提出的框架是否在多样化高维数据集上始终优于最先进方法?
主要发现
- 所提方法在30个高维数据集上实现了卓越的聚类性能,包括18个癌症基因表达数据集和12个图像/语音数据集。
- 将随机化度量与随机子空间结合,显著增强了多样性与聚类准确性。
- 基于熵的多样性准则有效捕捉并利用了集成中聚类层面的多样性。
- 三种共识函数在与所提框架结合后,均一致优于基线方法。
- 该方法在多样化数据类型和高维设置下表现出强鲁棒性与泛化能力。
- 源代码已公开,支持可复现性与进一步研究。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。