QUICK REVIEW

[论文解读] Enhancing Stability and Assessing Uncertainty in Community Detection through a Consensus-based Approach

Fabio Morea, Domenico De Stefano|arXiv (Cornell University)|Aug 6, 2024

Data-Driven Disease Surveillance被引用 5

一句话总结

本文提出 Consensus Community Detection (CCD)，一种框架，能够提升稳定性、量化节点级不确定性、检测异常值，并降低对任何社区检测算法的输入顺序偏差。

ABSTRACT

Complex data in social and natural sciences find effective representation through networks, wherein quantitative and categorical information can be associated with nodes and connecting edges. The internal structure of networks can be explored using unsupervised machine learning methods known as community detection algorithms. The process of community detection is inherently subject to uncertainty as algorithms utilize heuristic approaches and randomised procedures to explore vast solution spaces, resulting in non-deterministic outcomes and variability in detected communities across multiple runs. Moreover, many algorithms are not designed to identify outliers and may fail to take into account that a network is an unordered mathematical entity. The main aim of our work is to address these issues through a consensus-based approach by introducing a new framework called Consensus Community Detection (CCD). Our method can be applied to different community detection algorithms, allowing the quantification of uncertainty for the whole network as well as for each node, and providing three strategies for dealing with outliers: incorporate, highlight, or group. The effectiveness of our approach is evaluated on artificial benchmark networks.

研究动机与目标

在算法随机性与网络中的模糊性背景下，激励获得稳定且可解释的社区检测结果的需求。
提出一个可广泛应用于任何现有社区检测算法的通用 CCD 框架，用于量化不确定性并提高可靠性。
解决关键挑战：结果有效性、不同运行之间的变异性、异常值处理，以及输入顺序偏差。
提供一种在节点层面表示不确定性的结果表示机制，并促进对社区结构的解释。

提出的方法

使用目标算法对网络的置换版本运行多次随机分区。
根据相似性分数和分位数阈值，剪除偏离多数的分区。
从剩余分区构建共现矩阵，递归地将社区识别为带有分配的不确定性系数 γ 的区块。
输出一个带有社区标签和节点级不确定性 γ 的分区，γ 的取值在 [0,1]，其中 γ=0 表示稳定的共现，较高的 γ 表示残余变异。
引入用于选择分区的分位数阈值 q，以及用于在共现矩阵中定义区块的阈值 p。

Figure 1: Variability of results of selected community detection algorithms on a LFR benchmark network with a nominal mixing parameter $\mu=0.40$ . Top: distribution of the number of communities. Middle: similarity between pairs of partitions. Bottom: scatterplot modularity and similarity.

实验结果

研究问题

RQ1如何量化不确定性并将其融入社区检测结果？
RQ2基于共识的程序是否能提高不同算法或运行产生的分区的稳定性？
RQ3在社区检测背景下应如何识别和处理异常值？
RQ4输入顺序偏差如何影响结果，如何降低？
RQ5节点级不确定性 γ 与网络拓扑（如中心性或核心结构）之间的关系是什么？

主要发现

当迭代次数 t 增加时，CCD 在单次试验上的稳定性显著提升，并趋近于各算法特定的平稳点。
CCD 提供节点级不确定性系数 γ，能够识别社区分配不一致的节点（如潜在异常值）。
CCD 在大多数算法中降低输入顺序偏差，并提供一个与现有方法兼容、提升可靠性的框架。
该方法提供可解释的社区结构表示，带有明确的不确定性度量，在 Karate、RC、和 LFR networks 等基准测试上有所展示。
不确定性 γ 在 LFR 基准中随混合参数 μ 的增加呈非线性增长，不同算法显示出不同的不确定性模式。

Figure 2: Three alternative strategies to manage outliers: incorporate (left), highlight as single-node communities (center), or group into an outliers’ community (right). The top row shows the network; the bottom row shows a graph of the communities, labeled with the number of nodes in each communi

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。