QUICK REVIEW

[论文解读] Approximate Nearest Neighbors Search Without False Negatives For l_2 For c>sqrt{loglog{n}}

Piotr Sankowski, Piotr Wygocki|arXiv (Cornell University)|Jan 1, 2017

Computational Geometry and Mesh Generation参考文献 6被引用 1

一句话总结

本文提出了一种新颖的数据结构，用于在高维 l2 空间中实现无误报的 c-近似最近邻搜索，实现对数多项式查询时间与多项式预处理时间，适用于任意 c > √log log n。通过维度约减将问题简化，并利用基于最大 l2 的最近邻原语，克服了以往要求 c = Ω(√d) 的限制，显著提升了大规模、高熵数据集的效率。

ABSTRACT

In this paper, we report progress on answering the open problem presented by Pagh [11], who considered the near neighbor search without false negatives for the Hamming distance. We show new data structures for solving the c-approximate near neighbors problem without false negatives for Euclidean high dimensional space \mathcal{R}^d. These data structures work for any c = \omega(\sqrt{\log{\log{n}}}), where n is the number of points in the input set, with poly-logarithmic query time and polynomial pre-processing time. This improves over the known algorithms, which require c to be \Omega(\sqrt{d}). This improvement is obtained by applying a sequence of reductions, which are interesting on their own. First, we reduce the problem to d instances of dimension logarithmic in n. Next, these instances are reduced to a number of c-approximate near neighbor search without false negatives instances in \big(\Rspace^k\big)^L space equipped with metric m(x,y) = \max_{1 \le i \leL}(\dist{x_i - y_i}_2).

研究动机与目标

在高维 l2 空间中，针对近似因子 c > √log log n，解决无误报的 c-近似最近邻搜索问题。
克服以往要求 c = Ω(√d) 的限制，使算法在高维情况下变得可行。
设计一种具有对数多项式查询时间与多项式预处理时间的数据结构。
通过维度约减技术，将高维问题转化为一系列低维子问题。
在保证无误报方面提供确定性保障，确保所有真实邻居始终被返回。

提出的方法

使用推论5中的维度约减技术，将原始的 d 维问题转化为 d 个 O(log n) 维的问题实例。
将每个约减后的问题转化为最大 l2 最近邻问题，其度量为 m(x,y) = max_{1≤i≤L} ||x_i - y_i||_2。
采用一系列基于哈希的约减方法，使用局部敏感哈希（LSH），重点在于最小化误报。
使用多级哈希方案，进行 w 次迭代，每一点生成 3^{wL} 个哈希值，实现高效的候选过滤。
利用最大 l2_NN 原语解决每个子问题，以有界误报概率进行，随后通过精确距离检查筛选候选。
优化迭代次数 w，以在预处理时间和查询时间之间取得平衡，实现对 n 的次线性依赖。

实验结果

研究问题

RQ1能否在高维 l2 空间中，针对 c > √log log n 的情况，高效解决无误报的 c-近似最近邻搜索？
RQ2是否可能将对 c 的依赖从 Ω(√d) 降低至 o(√d)，同时保持对数多项式查询时间？
RQ3可采用哪些约减方法，将高维 l2 问题转化为可管理的子问题，并保证无误报？
RQ4最大 l2_NN 原语能否有效用于解决 NNwfn 问题，并实现次线性查询时间？
RQ5推导出的时间复杂度是否最优，或可进一步优化？

主要发现

论文实现了 Õ(d² + d n^{o(1)} |P|) 的查询时间与 Õ(d²n + d n^{1 + ln 3 / ln(c/μ) + 1/f(n)}) 的预处理时间，其中 μ = D √(f(n) log log n)。
该算法适用于任意 c > √log log n，显著优于以往要求 c = Ω(√d) 的方法。
维度约减后子问题的数量被限制在 Õ(n^{1/f(n)}) 以内，当 f(n) 缓慢增长时，该数量在 n 上为次多项式。
使用最大 l2_NN 作为原语，可在保持高效查询性能的同时，实现无误报的确定性避免。
对于常数 c，该算法的查询时间在 n 上为次线性，且预处理时间在 n 和 d 上保持多项式。
该框架具有通用性，可通过调整底层哈希与约减技术，推广至其他度量空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。