QUICK REVIEW

[论文解读] A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling

Sepehr Assadi, Michael Kapralov|arXiv (Cornell University)|Nov 19, 2018

Complexity and Algorithms in Graphs被引用 31

一句话总结

该论文提出了一种简单的亚线性时间算法，用于在大型图 $ G $ 中估计任意子图 $ H $ 的数量，使用度数、邻居、成对和边采样查询。该算法在 $ O^*\left(\frac{m^{\rho(H)}}{\#H}\right) $ 时间内实现了 $ (1\pm\varepsilon) $-近似，匹配三角形和团的最优界限，并通过边采样将结果推广至所有子图，解决了关于避免依赖 $ n $ 的项的猜想。

ABSTRACT

In the subgraph counting problem, we are given a input graph $G(V, E)$ and a target graph $H$; the goal is to estimate the number of occurrences of $H$ in $G$. Our focus here is on designing sublinear-time algorithms for approximately counting occurrences of $H$ in $G$ in the setting where the algorithm is given query access to $G$. This problem has been studied in several recent papers which primarily focused on specific families of graphs $H$ such as triangles, cliques, and stars. However, not much is known about approximate counting of arbitrary graphs $H$. This is in sharp contrast to the closely related subgraph enumeration problem that has received significant attention in the database community as the database join problem. The AGM bound shows that the maximum number of occurrences of any arbitrary subgraph $H$ in a graph $G$ with $m$ edges is $O(m^{\ ho(H)})$, where $\ ho(H)$ is the fractional edge-cover of $H$, and enumeration algorithms with matching runtime are known for any $H$. We bridge this gap between subgraph counting and subgraph enumeration by designing a sublinear-time algorithm that can estimate the number of any arbitrary subgraph $H$ in $G$, denoted by $\\#H$, to within a $(1\\pm \\epsilon)$-approximation w.h.p. in $O(\\frac{m^{\ ho(H)}}{\\#H}) \\cdot poly(\\log{n},1/\\epsilon)$ time. Our algorithm is allowed the standard set of queries for general graphs, namely degree queries, pair queries and neighbor queries, plus an additional edge-sample query that returns an edge chosen uniformly at random. The performance of our algorithm matches those of Eden et.al. [FOCS 2015, STOC 2018] for counting triangles and cliques and extend them to all choices of subgraph $H$ under the additional assumption of edge-sample queries. We further show that our algorithm works for the more general database join size estimation problem and prove a matching lower bound for this problem.

研究动机与目标

设计一种在标准查询访问下，用于估计大型图 $ G $ 中任意子图 $ H $ 出现次数的亚线性时间算法。
通过将亚线性算法从三角形和团等特定族类扩展至一般子图，弥合子图枚举与子图计数之间的差距。
实现与三角形和团已知界限相匹配的最优查询复杂度，同时通过边采样查询消除 $ n $-依赖的加法项。
为估计彩色子图计数的一般问题建立匹配的下界，证明查询复杂度的最优性。

提出的方法

该算法使用度数、邻居、成对和边采样查询，对边进行均匀随机采样，并探索顶点的局部邻域。
它利用了子图 $ H $ 的分数边覆盖数 $ \rho(H) $，该数值决定了子图计数的渐近复杂度。
核心思想是使用边采样来高效估计 $ H $-副本的数量，通过采样边并估计随机边属于 $ H $-副本的概率。
该算法将这些估计与一种随机采样策略相结合，以计算 $ \#H $（即图 $ G $ 中与 $ H $ 同构的子图数量）的 $ (1\pm\varepsilon) $-近似值。
它将方法推广至彩色子图，建模为数据库自然连接大小估计问题，并在该设定下证明了匹配的下界。
分析依赖于概率论证，并通过精心构造的图分布证明查询复杂度的下界，表明在无边采样时需要 $ \Omega(m) $ 次查询。

实验结果

研究问题

RQ1是否存在一种亚线性时间算法，能够估计图 $ G $ 中任意子图 $ H $ 的数量，而不仅限于三角形或团等特定情况？
RQ2引入边采样查询是否能够实现最优的亚线性算法，从而在查询复杂度中避免依赖 $ n $ 的加法项？
RQ3即使在引入边采样查询的情况下，查询复杂度 $ O^*\left(\frac{m^{\rho(H)}}{\#H}\right) $ 对于子图计数是否仍为紧致的？
RQ4该框架能否推广至更一般的问题：估计彩色子图的数量，对应于数据库中的自然连接大小估计？

主要发现

所提出的算法使用 $ O^*\left(\min\left\{m, \frac{m^{\rho(H)}}{\#H}\right\} \right) $ 次查询和 $ O^*\left(\frac{m^{\rho(H)}}{\#H}\right) $ 时间，实现了对 $ G $ 中 $ H $-副本数量的 $ (1\pm\varepsilon) $-近似。
对于 $ k $-团，该算法与 Eden 等人已知的最佳界限一致，但通过使用边采样查询避免了其 $ O^*\left(\frac{n}{(\#K_k)^{1/k}}\right) $ 的加法项。
该算法的查询复杂度在多对数因子范围内是最优的，如对一般彩色子图估计问题的 $ \Omega\left(\frac{m^{\rho(H)}}{\#H}\right) $ 下界所示。
下界构造使用了两个图分布 $ \mathcal{G}_0 $ 和 $ \mathcal{G}_1 $，二者仅在存在 $ m^{\rho(H)-1} $ 个彩色 $ H $-副本上有所不同，以证明在无边采样时需要 $ \Omega(m) $ 次查询。
该工作通过证明边采样查询可消除子图计数中对 $ n $-依赖项的需求，解决了 Eden 和 Rosenbaum 的一个猜想。
该框架可推广至通过彩色子图建模的数据库自然连接大小估计问题，且下界在该更广泛设定下依然成立，从而证明了其最优性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。