QUICK REVIEW

[论文解读] Almost Asymptotically Optimal Active Clustering Through Pairwise Observations

Rachel S. Y. Teo, P. N. Karthik|arXiv (Cornell University)|Feb 5, 2026

Advanced Clustering Algorithms Research被引用 0

一句话总结

该论文为带噪声两两查询的有监督聚类提出一个实例相关的下界，并提出一种渐近最优的算法与一个实用算法（A3CNP），在样本复杂度和 delta 正确停止方面接近最优。

ABSTRACT

We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses. At each time step, an agent is allowed to query pairs of items and observe bandit binary feedback. If the pair of items belongs to the same (resp.\ different) cluster, the observed feedback is $1$ with probability $p>1/2$ (resp.\ $q<1/2$). Leveraging the ubiquitous change-of-measure technique, we establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the clustering accuracy, formulated as a sup-inf optimization problem. Building on this theoretical foundation, we design an asymptotically optimal algorithm in which the stopping criterion involves an empirical version of the inner infimum -- the Generalized Likelihood Ratio (GLR) statistic -- being compared to a threshold. We develop a computationally feasible variant of the GLR statistic and show that its performance gap to the lower bound can be accurately empirically estimated and remains within a constant multiple of the lower bound.

研究动机与目标

将带噪声两两查询的聚类问题建模为带喂给的带带学习问题（bandit-like active learning problem）。
推导在可靠聚类所需查询次数上的实例相关下界。
基于信息论原理设计一个渐近最优的采样与停止框架。
提出一个可计算的变体，保持接近最优的性能。
给出一个实用算法（A3CNP），具备可证明的 delta-正确停止及可量化的次优界。

提出的方法

将每一对项建模为一个伯努利臂，若同簇则概率为 p，若不同簇则概率为 q，且 p>1/2>q 且未知。
通过变测度与 KL 散度导出样本复杂度的 sup-inf 下界 D*(C)。
将 Alt(C) 搜索降维到一个较小的 min(C) 集，便于对 D*(C) 的可加工评估。
提出一个由对 Ct 的估计 projected 到可行集合 C 的采样规则，采用 D-Tracking 风格的方法。
引入一个基于 GLR 的停止规则 Z(t)，与阈值 beta(t, delta) 进行比较以保证 delta-正确性。
提供一个计算上可行的替代 hatZ(t) 的代理统计量及保持 delta-正确性的实用停止规则。
给出 A3CNP，将采样规则、可行停止规则和投影步骤结合起来。

Figure 1: The asymptotic ( $\delta\to 0$ ) sample complexity of $\mathrm{A}^{3}\mathrm{CNP}$ , with varying $\epsilon$ (first argument) and $\sigma$ (second argument) values, relative to the active clustering algorithm of [ 10 ] . Also included in the plot are the theoretical lower ( 3 ) and upper b

实验结果

研究问题

RQ1在高置信度下恢复聚类所需的期望两两查询次数的实例相关下界是多少？
RQ2如何设计采样与停止规则，在带噪声观测的主动聚类中实现接近最优（或渐近最优）的样本复杂度？
RQ3一个计算上可行的变体能否在保持 delta-正确性的同时近似 GLR 停止规则？
RQ4在 p 和 q 未知的情况下，如何保持对聚类准确性的可证明保证？
RQ5相对于信息理论下界，实用近似所引入的次优差异有多大？

主要发现

通过对成对 KL 散度的 sup-inf 优化，建立了一个实例相关的样本复杂度下界。
推导出一个渐近最优的算法，停止基于一个经验 GLR 统计量，并给出一个可控间隙的可行变体。
使用投影步骤将问题约束到可行实例集合 C，以确保停止与采样规则的良定义性。
提出一个计算上可行的代理停止统计量，保持 delta-正确性并提升实际效率。
A3CNP 将 D-Tracking 采样与可行停止规则及投影结合，在可证明界内实现近最优性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。