[论文解读] Collecting and Analyzing Data from Smart Device Users with Local Differential Privacy
Harmony 是一个实用的本地差分隐私系统,用于从智能设备收集和分析多属性数据(数值和分类),实现均值/頻率估计和机器学习任务(线性/逻辑回归,SVM),具有强隐私保障。
Organizations with a large user base, such as Samsung and Google, can potentially benefit from collecting and mining users' data. However, doing so raises privacy concerns, and risks accidental privacy breaches with serious consequences. Local differential privacy (LDP) techniques address this problem by only collecting randomized answers from each user, with guarantees of plausible deniability; meanwhile, the aggregator can still build accurate models and predictors by analyzing large amounts of such randomized data. So far, existing LDP solutions either have severely restricted functionality, or focus mainly on theoretical aspects such as asymptotical bounds rather than practical usability and performance. Motivated by this, we propose Harmony, a practical, accurate and efficient system for collecting and analyzing data from smart device users, while satisfying LDP. Harmony applies to multi-dimensional data containing both numerical and categorical attributes, and supports both basic statistics (e.g., mean and frequency estimates), and complex machine learning tasks (e.g., linear regression, logistic regression and SVM classification). Experiments using real data confirm Harmony's effectiveness.
研究动机与目标
- 以本地差分隐私(LDP)为基础,推动来自大用户群体的隐私保护数据收集(例如 Samsung)。
- 开发 Harmony,以处理混合数值和分类属性,支持基本统计和经验风险最小化任务。
- 提供实用、准确、可扩展的 LDP 机制,具备理论保证并在真实数据上进行实证验证。
提出的方法
- 提出 Harmony,一个基于 LDP 的系统,在将包含数值和分类属性的用户元组扰动后再发送到聚合器。
- 为数值属性开发扰动机制,能够在受控误差下产生无偏均值估计,解决如 Duchi 等人方法在某些情形下存在缺陷。
- Introduce a simple, efficient 1-bit-per-user perturbation scheme for numeric attributes that achieves epsilon-LDP and unbiased mean estimation with improved empirical accuracy.
- Apply Bassily and Smith’s randomized projection approach for categorical attributes to estimate value frequencies (histograms) under epsilon-LDP, with adaptations to improve stability in practice.
- Extend the approach to multiple attributes by randomizing which attribute to report per user, combining numeric means and categorical frequencies in a single privacy-preserving framework.
- Demonstrate how Harmony supports empirical risk minimization tasks (linear regression, logistic regression, SVM) under LDP via stochastic gradient-based methods.
实验结果
研究问题
- RQ1Harmony 在 epsilon-LDP 的多属性设置下,能否为数值属性提供准确的均值估计并为分类属性提供可靠的频率估计?
- RQ2如何利用 LDP 使在结合数值和分类数据上的实际机器学习(线性/逻辑回归,SVM)成为可能?
- RQ3Harmony 的扰动机制的理论/误差保证是什么?与现有的 LDP 方法相比如何?
- RQ4当处理大量属性或多个分类属性时,对属性报告的随机化如何影响准确性?
主要发现
- 一个简单的每用户 1 位扰动机制用于数值属性,达到 epsilon-LDP 并实现无偏均值估计,具有可证明的误差界限,随 d 的平方根和 log(d/β) 的平方根增长。
- 对于分类属性,Harmony 使用 Bassily and Smith 的投影法,达到 O(sqrt(log(k/β))/(ε√n)) 的每个值误差,当 k 中等时比先前方法更稳定。
- 在处理多个属性时,Harmony 的方法对数值均值的每属性误差大致为 O(√(d log(d/β))/(ε√n)),对分类频率为 O(√(d log(k/β))/(ε√n)),并具备高概率保证(1−β)。
- Harmony 通过适当的扰动和学习管线,在本地差分隐私下支持经验风险最小化任务(线性回归、逻辑回归、SVM),在真实数据上验证了实际性能。
- 本文识别并纠正了先前本地 DP 均值估计方法中的问题(特别是 Duchi 等方法),并提供一个鲁棒、高效的替代方案,使用最少通信(数值数据每用户 1 位)。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。