QUICK REVIEW

[论文解读] Query Complexity of Clustering with Side Information

Arya Mazumdar, Barna Saha|arXiv (Cornell University)|Jun 23, 2017

Facility Location and Emergency Management被引用 32

一句话总结

本文研究了带有附加信息的聚类的查询复杂度，表明相似度矩阵可将成对查询次数从 Θ(nk) 降低至 O(k² log n / H²(f₊∥f₋})，其中 H² 为平方 Hellinger 散度。该方法在对数因子范围内达到信息论最优，且无需事先知晓 k、f₊ 或 f₋。

ABSTRACT

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively. Many heuristics have been proposed, and all of these use a similarity function to come up with a querying strategy. Even so, there is a lack systematic theoretical study. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$.

研究动机与目标

严格分析使用成对查询预言机的交互式聚类的查询复杂度。
研究以相似度矩阵形式存在的附加信息如何减少所需查询次数。
建立查询复杂度的紧致信息论下界与近乎匹配的上界。
设计高效、无需参数的算法，且无需事先知晓 k、f₊ 或 f₋。
证明所提方法在对数因子范围内具有理论最优性。

提出的方法

本文将相似度值建模为来自两种分布：f₊ 代表同簇对，f₋ 代表异簇对。
使用平方 Hellinger 散度 H²(f₊∥f₋) 衡量同簇与异簇对之间的统计可区分性。
所提算法利用相似度矩阵指导查询选择，聚焦于区分能力高的成对样本。
采用递归聚类策略，根据相似度评分和聚类分配的置信度自适应地查询成对样本。
该方法无需参数，不依赖于 k、f₊ 或 f₋ 的先验知识，且在 n 上呈对数尺度扩展。
理论分析结合信息论下界与构造性上界，证明了近似最优性。

实验结果

研究问题

RQ1相似度矩阵的存在如何影响使用预言机的聚类查询复杂度？
RQ2恢复真实聚类所需的查询次数的信息论下界是什么？
RQ3能否设计出高效且无需参数的算法，使其查询复杂度接近信息论极限？
RQ4平方 Hellinger 散度 H²(f₊∥f₋) 如何量化附加信息在减少查询次数方面的优势？
RQ5所提查询复杂度是否在对数因子范围内达到最优？

主要发现

在无相似度矩阵时，查询复杂度为 Θ(nk)，而引入相似度矩阵后降低至 O(k² log n / H²(f₊∥f₋})。
所提算法在无需事先知晓 k、f₊ 或 f₋ 的情况下实现了该复杂度。
上界在 O(log n) 因子范围内为信息论最优。
平方 Hellinger 散度 H²(f₊∥f₋) 量化了同簇与异簇对之间的统计分离程度。
该方法高效且在 n 上呈对数尺度扩展，适用于大规模聚类任务。
理论框架建立了相似度质量与查询效率之间的紧密联系。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。