QUICK REVIEW

[论文解读] PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions

Jun Zhang, Xiaokui Xiao|arXiv (Cornell University)|Jan 13, 2016

Cryptography and Data Security参考文献 54被引用 23

一句话总结

PrivTree 是一种用于层次化数据分解的差分隐私算法，通过基于紧密拉普拉斯分布分析的新型噪声机制，消除了对预定义递归深度的需求。它通过在分割决策中仅注入恒定噪声，实现了卓越的数据效用，在空间数据和序列数据发布中的差分隐私保护方面优于当前最先进方法。

ABSTRACT

Given a set D of tuples defined on a domain Omega, we study differentially private algorithms for constructing a histogram over Omega to approximate the tuple distribution in D. Existing solutions for the problem mostly adopt a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. This approach, however, requires that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. This leads to inferior data utility due to the following dilemma: if we use a small h, then the resulting histogram would be too coarse-grained to provide an accurate approximation of data distribution; meanwhile, a large h would yield a fine-grained histogram, but its quality would be severely degraded by the increased amount of noise in the tuple counts. To remedy the deficiency of existing solutions, we present PrivTree, a histogram construction algorithm that also applies hierarchical decomposition but features a crucial (and somewhat surprising) improvement: when deciding whether or not to split a sub-domain, the amount of noise required in the corresponding tuple count is independent of the recursive depth. This enables PrivTree to adaptively generate high-quality histograms without even asking for a pre-defined threshold on the depth of sub-domain splitting. As concrete examples, we demonstrate an application of PrivTree in modelling spatial data, and show that it can also be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree significantly outperforms the states of the art in terms of data utility.

研究动机与目标

解决差分隐私层次化分解中的根本困境：递归深度与噪声放大的权衡问题。
消除对预定义最大递归深度 $ h $ 的依赖，该依赖会损害隐私或效用。
开发一种机制，在确保差分隐私的同时，实现对私有数据的细粒度、高精度直方图构建。
将该方法扩展至非基于计数的分解，如基于马尔可夫模型的序列数据。
在真实世界数据集中，显著提升相对于现有最先进方法的数据效用。

提出的方法

PrivTree 使用一种新颖的隐私机制，通过紧密分析拉普拉斯分布来独立于递归深度地限制隐私泄露。
它引入了一种恒定噪声机制，用于判断子域是否应被分割，避免了与递归深度 $ h $ 成比例的噪声。
该算法递归地将域划分为子域，使用具有固定噪声尺度的噪声计数，确保 $ \varepsilon $-差分隐私。
对于序列数据，PrivTree 集成马尔可夫模型，基于序列模式似然而非原始计数来评估子域分割。
该方法支持在 $ \varepsilon $-差分隐私下进行多维空间直方图构建和频繁模式挖掘。
其设计兼容基于格的模型，并可扩展至其他分解任务。

实验结果

研究问题

RQ1是否可以在不固定递归深度 $ h $ 的情况下实现差分隐私下的层次化分解，从而避免效用-隐私权衡？
RQ2是否可能在不分裂深度的情况下，始终使用恒定数量的噪声进行分割决策，同时仍确保差分隐私？
RQ3PrivTree 在空间数据和序列数据的数据效用方面与最先进方法相比如何？
RQ4核心机制是否可扩展至非基于计数的分解，如序列模式挖掘？
RQ5隐私预算 $ \varepsilon $ 对序列重构和模式恢复的准确性有何影响？

主要发现

PrivTree 在空间数据上的数据效用显著优于最先进方法，范围计数查询的相对误差更低。
在序列数据发布中，PrivTree 的精度高于 N-gram 和 Truncate，尤其在更高的隐私预算 $ \varepsilon \geq 0.2 $ 时表现更优。
PrivTree 生成的序列长度分布的总变差距离与 Truncate 相当，远低于 N-gram，表明其分布保真度更优。
基于 EM 的方法在 $ k $ 增大时准确度下降，而 PrivTree 在各种设置下均保持一致的性能。
PrivTree 通过马尔可夫模型将方法扩展至序列数据，能够准确恢复截断序列，展示了在模式重构中的鲁棒性。
即使在偏斜的真实世界数据集上，该算法的性能依然稳定且高效，而启发式方法则会失效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。