QUICK REVIEW

[论文解读] HyperMinHash: Jaccard index sketching in LogLog space.

Yun William Yu, Griffin M. Weber|arXiv (Cornell University)|Oct 23, 2017

Adversarial Robustness in Machine Learning被引用 8

一句话总结

HyperMinHash 是一种流式概率性摘要，通过每桶仅使用 O(log l + log log |A ∪ B|) 位比特来估计两个集合之间的 Jaccard 指数，与 MinHash 相比显著减少了存储开销。在 64KiB 内存下，其相对误差为 O(1/l + √(k/δ))，可实现对最大达 10^19 个元素的集合的精确 Jaccard 估计，远超 MinHash 在相同内存约束下的 10^10 限制。

ABSTRACT

In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets $A$ and $B$. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index $\delta$, using $k$ buckets of size $O(\log(l) + \log\log(|A \cup B|))$ (in practice, typically 2 bytes) per set, HyperMinHash streams over $A$ and $B$ and generates an estimate of the Jaccard index $\delta$ with error $O(1/l + \sqrt{k/\delta})$. This improves on the best previously known sketch, MinHash, which requires the same number of storage units (buckets), but using $O(\log(|A \cup B|))$ bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of $10^{19}$ with relative error of around 5% using 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of $10^{10}$ with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation (and thus intersection cardinality by way of Jaccard), while using $\log\log$ extra bits.

研究动机与目标

设计一种流式摘要，与 MinHash 相比显著降低内存使用量以估计 Jaccard 相似度。
在内存受限条件下，实现对极大集合（最大达 10^19 个元素）的精确 Jaccard 指数估计。
将基数估计和并集操作集成到紧凑且支持流式处理的摘要结构中。
将 MinHash 中每桶的存储从 O(log |A ∪ B|) 位比特减少到 O(log l + log log |A ∪ B|) 位比特，接近 LogLog 的存储空间。

提出的方法

HyperMinHash 通过引入类似 HyperLogLog 的去重计数摘要来高效追踪集合基数，从而对标准 MinHash 进行压缩。
它使用 k 个桶，每个桶存储 O(log l + log log |A ∪ B|) 位比特——通常每个桶仅需 2 字节，从而实现紧凑存储。
该算法对集合 A 和 B 进行流式处理，维护哈希值，并应用概率计数技术以估计 Jaccard 相似度。
它利用 HyperLogLog 估计器推断基数和交集大小，进而计算 Jaccard 指数。
该摘要原生支持流式更新、并集操作和基数估计，扩展了 b-bit MinHash 的功能。
误差界推导为 O(1/l + √(k/δ))，其中 l 为哈希函数的数量，k 为桶的数量。

实验结果

研究问题

RQ1能否设计一种摘要，使其内存使用量接近 HyperLogLog，同时保持 MinHash 的精度以估计 Jaccard 相似度？
RQ2在使用 HyperLogLog 计数原理压缩 MinHash 时，可达到的误差界是多少？
RQ3该摘要能否在同一个紧凑结构中同时支持流式更新、并集操作和基数估计？
RQ4当集合基数增加，特别是超过 10^10 个元素时，该方法的内存效率如何扩展？

主要发现

在仅使用 64KiB 内存的情况下，HyperMinHash 对大小为 10^19 的集合、Jaccard 指数为 0.01 的情况，可实现约 5% 的相对误差。
该方法将每桶的存储从 MinHash 中的 O(log |A ∪ B|) 位比特减少到 O(log l + log log |A ∪ B|) 位比特，从而实现显著的内存节省。
在相同的 64KiB 内存预算下，HyperMinHash 支持的最大集合基数可达 10^19，而 MinHash 仅能支持约 10^10。
该摘要原生支持流式更新、并集操作和基数估计，以紧凑形式扩展了 b-bit MinHash 的功能。
误差界 O(1/l + √(k/δ)) 表明，随着桶数 k 增加和哈希函数数 l 提高，精度得以提升，即使在 Jaccard 指数较低时也保持稳健。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。