QUICK REVIEW

[论文解读] Similarity Join and Self-Join Size Estimation in a Streaming Environment.

Davood Rafiei, Fan Deng|arXiv (Cornell University)|Jun 8, 2018

Data Quality and Management被引用 1

一句话总结

本文提出了一种单遍扫描、亚线性空间的算法，用于在流数据中估计相似度自连接和连接的大小，其中记录之间的相似度范围为1到d。在相同的空间约束下，该算法的估计误差显著低于现有方法，误差降低幅度高达一个数量级，尤其在广泛的相似度阈值范围内表现更优。

ABSTRACT

We study the problem of similarity self-join and similarity join size estimation in a streaming setting where the goal is to estimate, in one scan of the input and with sublinear space in the input size, the number of record pairs that have a similarity within a given threshold. The problem has many applications in data cleaning and query plan generation, where the cost of a similarity join may be estimated before actually doing the join. On unary input where two records either match or don't match, the problem becomes join and self-join size estimation for which one-pass algorithms are readily available. Our work addresses the problem for d-ary input, for d >= 1, where the degree of similarity can vary from 1 to d. We show that our proposed algorithm gives an accurate estimate and scales well with the input size. We provide error bounds and time and space costs, and conduct an extensive experimental evaluation of our algorithm, comparing its estimation accuracy to a few competitors, including some multi-pass algorithms. Our results show that given the same space, the proposed algorithm has an order of magnitude less error for a large range of similarity thresholds.

研究动机与目标

解决在内存受限环境下流式环境中估计相似度连接和自连接大小的挑战。
提供一种单遍算法，能够在输入规模增大时高效扩展，同时保持高精度。
为实际部署提供误差界以及空间/时间复杂度分析。
在空间受限条件下，优于现有的多遍扫描和单遍扫描算法，在估计精度上表现更优。

提出的方法

该算法通过单遍处理数据，维护一个紧凑的摘要结构，以估计相似度高于阈值的记录对数量。
采用针对d元相似度值（d ≥ 1）定制的概率采样和Sketch技术。
该方法采用相似度敏感的采样策略，根据记录形成相似对的潜力对其加权。
应用集中不等式推导出在不同相似度阈值下估计误差的理论误差界。
该算法根据流中观测到的相似度分布动态调整采样率。
集成空间高效的数细结构，确保相对于输入大小的亚线性空间使用。

实验结果

研究问题

RQ1在流式环境中，是否能够通过单遍算法在亚线性空间下实现准确的相似度自连接大小估计？
RQ2在不同相似度阈值下，所提方法的估计误差与多遍扫描和单遍扫描竞争对手相比如何？
RQ3在不同输入分布下，能否为所提算法建立理论误差界？
RQ4该算法在输入规模增大和相似度阈值变化时的可扩展性如何？
RQ5在所提方法中，空间使用与估计精度之间的权衡关系是什么？

主要发现

在相同主存储器容量下，所提算法的估计误差比现有方法低一个数量级。
该算法在广泛的相似度阈值范围内保持高精度，优于单遍和多遍扫描竞争对手。
建立了理论误差界，并在现实流式假设下得到验证。
该算法在输入规模增大时表现出良好的可扩展性，保持低空间和时间复杂度。
实验评估证实，该方法在多样化的合成数据集和真实数据集上均能持续降低误差。
在相似度方差较高的场景中，该方法的精度尤为突出，而传统方法在此类场景中性能会下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。