QUICK REVIEW

[论文解读] Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

Justin Y. Chen, Piotr Indyk|arXiv (Cornell University)|Nov 29, 2023

Machine Learning and Algorithms被引用 2

一句话总结

本文提出了一种空间最优的流式算法，用于估计数据流的轮廓——即恰好出现 i 次的元素数量——同时针对多个频率进行估计。该算法实现了最优的空间复杂度：对于前 τ 个条目上 L1 误差 ≤ ǫD 的情形，空间复杂度为 O(1/ǫ² + log n) 位；对于所有频率上总 L1 误差 ≤ ǫm 的情形，空间复杂度为 O(1/ǫ² log(1/ǫ) + log n + log log m) 位，且匹配的下界证明了其最优性。

ABSTRACT

We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of $m$ elements from a universe of size $n$, its profile is a vector $ϕ$ whose $i$-th entry $ϕ_i$ represents the number of distinct elements that appear in the stream exactly $i$ times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry $ϕ_i$ up to an additive error of $\pm εD$ using $O(1/ε^2 (\log n + \log m))$ bits of space, where $D$ is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector $ϕ$ up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile $\hatϕ$ with the following guarantees in terms of space and estimation error: - For any constant $τ$, with $O(1 / ε^2 + \log n)$ bits of space, $\sum_{i=1}^τ|ϕ_i - \hatϕ_i| \leq εD$. - With $O(1/ ε^2\log (1/ε) + \log n + \log \log m)$ bits of space, $\sum_{i=1}^m |ϕ_i - \hatϕ_i| \leq εm$. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on $1/ε$ and those that depend on $n$ and $m$. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error $\pm εD$ of several symmetric functions of frequencies in $O(1/ε^2 + \log n)$ bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.

研究动机与目标

设计一种流式算法，以最小的空间复杂度估计轮廓向量 φ（频率的频率），同时在多个坐标上保持较小的 L1 误差。
通过利用轮廓估计，将对称函数（如不同元素数量、Huber 损失和频率上限）的空间最优估计进行推广。
分离空间复杂度中对 1/ǫ、n 和 m 的依赖关系，从而实现更紧致且模块化的空间界限。
为两种误差情形证明匹配的下界，从而确立所提算法的最优性。
通过考虑随机性和哈希函数存储的空间开销，改进先前工作，降低对称函数估计的空间复杂度，尤其在该因素被纳入考量时。

提出的方法

通过从索引问题（IND）到轮廓估计的随机化归约，建立空间下界，利用轮廓估计模拟汉明距离估计。
在几何频率尺度（桶）上采用分层采样策略，高效估计不同频率范围内的 φi。
应用公开随机性的随机化归约，将 IND 问题映射到轮廓估计，利用共享随机性模拟通信复杂度。
构造一个数据流，使得元素的频率对应于两个字符串之间的汉明距离，从而使得 φ1 可用于估计 ∆(w,z)。
采用多尺度方法，使用频率桶 Bj = [2^{j-1}+1, 2^j]，确保至少恒定比例的桶中，每个坐标的估计误差被有界。
依赖大数定律和误差传播分析，控制所有坐标上的 L1 误差，确保整体误差 ≤ ǫD 或 ǫm。

实验结果

研究问题

RQ1我们能否以小于独立估计每个坐标所需的空间，同时估计轮廓向量 φ 的多个坐标，且总 L1 误差较小？
RQ2在前 τ 个条目上，以 L1 误差 ≤ ǫD 估计轮廓向量 φ 的最优空间复杂度是多少？
RQ3在完整轮廓向量 φ 上，以 L1 误差 ≤ ǫm 估计的最优空间复杂度是多少？
RQ4我们能否在轮廓估计算法中分离对 1/ǫ、n 和 m 的空间依赖关系？
RQ5所提出的空间界限是否紧致？我们能否为两种误差情形证明匹配的下界？

主要发现

该算法在前 τ 个条目上以 L1 误差 ≤ ǫD 的情形下，空间复杂度为 O(1/ǫ² + log n) 位，且具有常数概率。
对于完整轮廓估计且总 L1 误差 ≤ ǫm 的情形，该算法使用 O(1/ǫ² log(1/ǫ) + log n + log log m) 位空间，与下界完全匹配。
论文证明了估计 φ1 到加法误差 ǫD 的 Ω(1/ǫ²) 空间下界，从而确立了最优性。
该算法通过考虑随机性和哈希函数的空间开销，改进了先前工作，为对称函数估计带来了更紧致的界限。
该方法实现了对对称函数（如 Huber 损失、Tukey 损失和频率上限）的空间最优估计，仅需 O(1/ǫ² + log n) 位空间。
分析表明，至少 9/10 的频率桶中，每个桶的 L1 误差被有界于 O(γ/ǫ) 以内，从而以高概率实现对单个 φi 值的准确估计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。