QUICK REVIEW

[论文解读] Differentially Private Publication of Sparse Data

Graham Cormode, Cecilia M. Procopiuc|arXiv (Cornell University)|Mar 4, 2011

Privacy-Preserving Technologies in Data参考文献 19被引用 37

一句话总结

本文提出了一种可扩展的方法，用于对稀疏数据集进行差分隐私发布，通过直接生成噪声数据的紧凑摘要，避免了生成庞大列联表的需求。该方法利用过滤、优先采样和一致性检查，将输出大小减少数个数量级，同时保持强大的隐私保证，并且查询准确性与或优于朴素噪声注入方法。

ABSTRACT

The problem of privately releasing data is to provide a version of a dataset without revealing sensitive information about the individuals who contribute to the data. The model of differential privacy allows such private release while providing strong guarantees on the output. A basic mechanism achieves differential privacy by adding noise to the frequency counts in the contingency tables (or, a subset of the count data cube) derived from the dataset. However, when the dataset is sparse in its underlying space, as is the case for most multi-attribute relations, then the effect of adding noise is to vastly increase the size of the published data: it implicitly creates a huge number of dummy data points to mask the true data, making it almost impossible to work with. We present techniques to overcome this roadblock and allow efficient private release of sparse data, while maintaining the guarantees of differential privacy. Our approach is to release a compact summary of the noisy data. Generating the noisy data and then summarizing it would still be very costly, so we show how to shortcut this step, and instead directly generate the summary from the input data, without materializing the vast intermediate noisy data. We instantiate this outline for a variety of sampling and filtering methods, and show how to use the resulting summary for approximate, private, query answering. Our experimental study shows that this is an effective, practical solution, with comparable and occasionally improved utility over the costly materialization approach.

研究动机与目标

解决在稀疏数据集的差分隐私发布中因朴素噪声注入导致全表列联表过大而引发的可扩展性瓶颈问题。
在不生成完整噪声表的前提下，实现对高维、低密度数据的高效私有查询回答。
开发直接从原始数据生成紧凑、隐私保护摘要的技术，最大限度减少计算和存储开销。
通过过滤和采样策略减少噪声传播，从而提高私有数据发布的实用性。
评估一致性检查和二元范围表示法在提升稀疏数据上范围查询准确性方面的有效性。

提出的方法

提出一种快捷方法，直接生成隐私保护摘要，而无需生成完整的噪声列联表。
在注入噪声前应用过滤，去除低价值条目，以减少噪声对稀疏区域的影响。
使用优先采样根据条目大小选择代表性条目，保留信号的同时最小化输出大小。
在二元范围内集成一致性检查，以消除原始为零条目中的噪声，提升在稀疏且非均匀数据上的准确性。
将过滤与优先采样结合，形成一种自适应数据稀疏性和查询模式的混合‘过滤-优先’方法。
使用几何机制噪声作为基线对比，但优化摘要构造过程，避免生成完整表格。

实验结果

研究问题

RQ1我们能否在不生成完整噪声列联表的前提下，对稀疏、高维数据集实现差分隐私？
RQ2在注入噪声前过滤低价值条目，对差分隐私查询结果的实用性与准确性有何影响？
RQ3与均匀采样或朴素噪声注入相比，优先采样在多大程度上提升了实用性？
RQ4一致性检查在减少稀疏数据中原始为零条目上的噪声方面有多有效？
RQ5二元范围表示法能否提升对紧凑摘要化私有数据进行范围查询的准确性？

主要发现

所提出的快捷方法相比朴素噪声注入，将输出大小减少了高达1000倍，使大规模稀疏数据集的私有发布成为可能。
过滤-优先采样在覆盖数据空间5%或以上的查询中，实现了低于0.8%的相对查询误差，其准确性与完整噪声表相当或更优。
在高度稀疏且非均匀的数据上，一致性检查使误差降低了30%至60%；在更均匀的数据集上，误差降低约10%。
在查询准确性方面，该方法优于概率性差分隐私技术（如Machanavajjhala等人提出的方法），在相似隐私参数下，其绝对误差低于后者的三倍以上。
对于大范围查询（覆盖数据空间20%或以上），优先采样显著优于其他方法，其误差与完整几何机制相当，但输出大小小得多。
紧凑摘要方法在保持强隐私保证的同时，通过一致性检查消除了虚假零条目中的噪声，从而提升了实用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。