QUICK REVIEW

[论文解读] Fast Detection of Overlapping Communities via Online Tensor Methods

Furong Huang, U. N. Niranjan|arXiv (Cornell University)|Sep 3, 2013

Tensor decomposition and applications参考文献 16被引用 31

一句话总结

本文提出了一种基于张量的快速、可扩展方法，通过在多线性谱优化上使用随机梯度下降，实现对大规模网络中重叠社区的检测。该方法在真实世界数据集（如Facebook、Yelp和DBLP）上实现了与最先进方法相当的高精度社区成员恢复，同时在执行速度上实现了数量级的提升。

ABSTRACT

We present a fast tensor-based approach for detecting hidden overlapping communities under the Mixed Membership Stochastic Blockmodel (MMSB). We present two implementations, viz., a GPU-based implementation which exploits the parallelism of SIMD architectures and a CPU-based implementation for larger datasets, wherein the GPU memory does not suffice. Our GPU-based implementation involves a careful optimization of storage, data transfer and matrix computations. Our CPU-based implementation involves sparse linear algebraic operations which exploit the data sparsity. We use stochastic gradient descent for multilinear spectral optimization and this allows for flexibility in the tradeoff between node sub-sampling and accuracy of the results. We validate our results on datasets from Facebook, Yelp and DBLP where ground truth is available, using notions of p-values and false discovery rates, and obtain high accuracy for membership recovery. We compare our results, both in terms of execution time and accuracy, to the state-of-the-art algorithms such as the variational method, and report many orders of magnitude gain in the execution time. The tensor method is also applicable for unsupervised learning of a wide range of latent variable models, and we also demonstrate efficient recovery of topics from the New York Times dataset.

研究动机与目标

解决在具有隐藏社区结构的大规模网络中高效检测重叠社区的挑战。
克服现有变分方法和张量方法在大规模数据集上的计算局限性。
通过利用现代架构中的数据稀疏性和并行性，实现可扩展且准确的社区检测。
提供一个灵活的框架，通过随机优化实现节点子采样与估计精度之间的权衡。

提出的方法

采用混合成员随机块模型（MMSB）作为重叠社区的底层生成模型。
通过随机梯度下降实施多线性谱优化，以实现在大规模数据集上的在线、增量学习。
实现基于GPU的版本，优化内存访问、数据传输和矩阵运算，以支持SIMD并行处理。
开发基于CPU的变体，利用稀疏线性代数处理超出GPU内存容量的数据集。
利用张量分解技术从网络数据中揭示潜在的重叠社区结构。
通过在线学习实现节点子采样与估计精度之间的灵活权衡。

实验结果

研究问题

RQ1基于张量的方法是否能显著快于最先进算法，实现重叠社区检测？
RQ2所提出方法在具有已知社区结构的真实世界网络中，对真实社区成员的恢复效果如何？
RQ3该方法在超出GPU内存限制的大规模数据集上可扩展到何种程度？
RQ4使用随机梯度下降如何影响社区检测的精度和收敛性？
RQ5该方法能否有效扩展到社区检测之外的无监督主题建模任务？

主要发现

基于GPU的实现相比最先进变分方法，执行速度提升了多个数量级。
在Facebook、Yelp和DBLP数据集上，该方法在社区成员恢复方面实现了高精度，通过p值和错误发现率进行了验证。
基于CPU的实现通过稀疏线性代数操作利用数据稀疏性，成功处理了更大的数据集。
随机梯度下降实现了子采样与精度之间的有效权衡，在不同数据规模下均保持了稳健性能。
该方法在社区检测之外也表现出良好的泛化能力，能够高效地从《纽约时报》数据集中恢复主题。
基于张量的方法展现出强大的可扩展性和精度，无论在速度还是精度上均优于现有方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。