QUICK REVIEW

[论文解读] Impact of regularization on Spectral Clustering

Antony Joseph, Bin Yu|arXiv (Cornell University)|Dec 5, 2013

Complex Network Analysis Techniques参考文献 22被引用 18

一句话总结

本文在随机块模型下对谱聚类中的正则化进行了理论分析，表明正则化通过基于最大度增长实现聚类恢复，消除了对最小度假设的依赖。本文提出DKest，一种基于最小化估计的Davis-Kahan界来选择正则化参数τ的数据驱动方法，显著提升了模拟网络和真实网络中的性能。

ABSTRACT

The performance of spectral clustering can be considerably improved via regularization, as demonstrated empirically in Amini et. al (2012). Here, we provide an attempt at quantifying this improvement through theoretical analysis. Under the stochastic block model (SBM), and its extensions, previous results on spectral clustering relied on the minimum degree of the graph being sufficiently large for its good performance. By examining the scenario where the regularization parameter $τ$ is large we show that the minimum degree assumption can potentially be removed. As a special case, for an SBM with two blocks, the results require the maximum degree to be large (grow faster than $\log n$) as opposed to the minimum degree. More importantly, we show the usefulness of regularization in situations where not all nodes belong to well-defined clusters. Our results rely on a `bias-variance'-like trade-off that arises from understanding the concentration of the sample Laplacian and the eigen gap as a function of the regularization parameter. As a byproduct of our bounds, we propose a data-driven technique extit{DKest} (standing for estimated Davis-Kahan bounds) for choosing the regularization parameter. This technique is shown to work well through simulations and on a real data set.

研究动机与目标

为了从理论上理解正则化如何提升谱聚类在社区检测中的性能。
为了通过利用正则化消除先前谱聚类分析中对最小度假设的限制。
为解决低度节点不属于明确社区的网络中的聚类恢复问题。
为基于理论界开发一种数据驱动的方法来选择正则化参数τ。
在模拟网络和真实网络数据上展示所提方法的有效性。

提出的方法

在随机块模型（SBM）及其弱连通社区扩展下分析正则化谱聚类（RSC）。
将特征值间隔与样本拉普拉斯矩阵集中度之间的偏差-方差类似权衡关系，作为正则化参数τ的函数。
推导出拉普拉斯矩阵差的谱范数的高概率界，其随τ增大而以1/τ的速率衰减，优于先前的1/√τ界。
提出DKest，一种数据依赖的程序，通过在τ值网格上估计Davis-Kahan界并选择使这些界最小的τ。
通过利用聚类成员关系和节点度估计边概率，将DKest扩展至度校正SBM。
利用估计的边概率和节点度构建正则化种群拉普拉斯矩阵，以计算依赖τ的界。

实验结果

研究问题

RQ1谱聚类中的正则化能否消除社区检测中对最小度假设的需求？
RQ2当低度节点不属于明确社区时，正则化如何影响聚类恢复？
RQ3正则化参数τ与样本拉普拉斯矩阵集中度及特征值间隔之间的理论关系是什么？
RQ4能否通过估计特征向量误差的理论界，开发一种数据驱动的方法来选择τ？
RQ5所提出的DKest方法在聚类准确率方面与经验选择的τ相比表现如何？

主要发现

对于具有两个块的SBM，当最大度增长快于log n时，即可实现聚类恢复，而无需最小度满足此类条件。
使用大τ进行正则化可有效去除不属于明确簇的低度节点，从而改善社区间特征向量的分离。
拉普拉斯矩阵差的谱范数的理论界在τ较大时以1/τ的速率衰减，优于先前的1/√τ速率。
特征值间隔在τ较大时也以1/τ的速率衰减，表明特征向量估计中偏差与方差之间存在平衡。
DKest通过最小化估计的Davis-Kahan界成功选择τ，并在模拟和真实数据中均优于固定τ的选择。
DKest扩展至度校正SBM后，可在节点度异质性较强的网络中实现稳健的参数选择。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。