QUICK REVIEW

[论文解读] Distributed Graph Clustering and Sparsification

He Sun, Luca Zanetti|arXiv (Cornell University)|Nov 3, 2017

Complex Network Analysis Techniques参考文献 13被引用 3

一句话总结

本文提出了一种简单且分布式的图聚类算法，采用一种新颖的稀疏化技术，在大规模图中保留聚类结构。通过基于局部导出率和谱性质采样边，该方法将边数减少至接近线性规模，同时保持低导出率的聚类，实现在 O(1) 轮内高效聚类，且通信开销极低。

ABSTRACT

Graph clustering is a fundamental computational problem with a number of applications in algorithm design, machine learning, data mining, and analysis of social networks. Over the past decades, researchers have proposed a number of algorithmic design methods for graph clustering. Most of these methods, however, are based on complicated spectral techniques or convex optimisation, and cannot be directly applied for clustering many networks that occur in practice, whose information is often collected on different sites. Designing a simple and distributed clustering algorithm is of great interest, and has wide applications for processing big datasets. In this paper we present a simple and distributed algorithm for graph clustering: for a wide class of graphs that are characterised by a strong cluster-structure, our algorithm finishes in a poly-logarithmic number of rounds, and recovers a partition of the graph close to optimal. One of the main components behind our algorithm is a sampling scheme that, given a dense graph as input, produces a sparse subgraph that provably preserves the cluster-structure of the input. Compared with previous sparsification algorithms that require Laplacian solvers or involve combinatorial constructions, this component is easy to implement in a distributed way and runs fast in practice.

研究动机与目标

设计一种适用于大规模网络和去中心化数据的简单、分布式图聚类算法。
开发一种稀疏化方法，在大幅减少边数的同时保留密集图的聚类结构。
通过最小化每轮的通信和计算量，实现在分布式系统中高效聚类。
在多项式对数轮内提供关于聚类保持性和收敛时间的理论保证。

提出的方法

提出一种基于采样的稀疏化方案，根据局部导出率和谱间隙（λk+1）选择边，以保留聚类结构。
使用倍增法确定最优采样参数 τ ≥ C/λk+1，确保结构保真度。
在分布式环境中实现，每个节点独立根据局部权重和度数采样边。
在稀疏化图上应用谱聚类，以恢复接近原始最优划分的聚类。
在分布式环境中引入标签传播机制，以低误分类体积分配聚类标签。
理论分析表明，该稀疏化器保持了 Ω(ΥG(k)/k) 的间隙和 O(k·φG(Si)) 的导出率，每个聚类 Si 均成立。

实验结果

研究问题

RQ1能否通过一种简单、分布式的算法在多项式对数轮内实现接近最优的图聚类？
RQ2如何在保留底层聚类结构的同时对密集图进行稀疏化？
RQ3何种采样策略可确保原始图中低导出率的聚类在稀疏化版本中仍保持低导出率？
RQ4此类分布式聚类算法的通信复杂度和轮复杂度是多少？
RQ5在稀疏化图上进行谱聚类在多大程度上能近似原始图的聚类结果？

主要发现

该算法在 O(1) 轮内完成聚类，总通信量为 O(nτ·log n)，其中 τ = 1.6 在所有测试数据集中均足够。
稀疏化图保留在原始边数的 0.14% 至 3.13% 之间，同时聚类质量误差在 0.1% 以内。
对于 Sculpture 数据集（11,680 个节点，6800 万个边），仅采样了 0.37% 的边（320,000 条），得到的归一化割值为 0.0935，与原始图的 0.0938 非常接近。
稀疏化图中每个聚类的导出率与原始图相比，因子在 O(k) 以内，确保了结构保真度。
该算法保持了 ΥH(k) = Ω(ΥG(k)/k)，保留了确保聚类清晰定义所必需的谱间隙。
可视化结果和误差比显示，所有数据集中原始图与稀疏化图的聚类结果几乎完全一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。