QUICK REVIEW

[论文解读] Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.1)

Ahmet Aktay, Shailesh Bavadekar|arXiv (Cornell University)|Apr 8, 2020

Data-Driven Disease Surveillance被引用 158

一句话总结

本文描述了 Google’s COVID-19 Community Mobility Reports 的匿名化与差分隐私方法，包括噪声添加、贡献界限、基线设定和数据可靠性过滤器。

ABSTRACT

This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at http://google.com/covid19/mobility on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at flattening the curve of the COVID-19 pandemic. Our anonymization process is designed to ensure that no personal data, including an individual's location, movement, or contacts, can be derived from the resulting metrics. The high-level description of the procedure is as follows: we first generate a set of anonymized metrics from the data of Google users who opted in to Location History. Then, we compute percentage changes of these metrics from a baseline based on the historical part of the anonymized metrics. We then discard a subset which does not meet our bar for statistical reliability, and release the rest publicly in a format that compares the result to the private baseline.

研究动机与目标

解释如何从 Location History 数据生成去标识化的指标。
描述所使用的差分隐私机制和噪声尺度。
定义用于发布指标的数据可靠性标准和区域大小约束。
解释基线计算与百分比变化的报告。
讨论随时间推移对准确性与隐私预算管理的更新。

提出的方法

使用开源差分隐私库对每个度量的计数和时长添加拉普拉斯噪声。
将每个用户的贡献限制为在每个地理层级每天最多四个（类别，地点）对。
用差分隐私计算日度指标和基线指标，然后发布相对于基线的百分比变化。
丢弃小于 3 km^2 的区域或噪声用户计数低于 <100 的指标。
使用与工作日匹配的日子计算固定的 5 周基线，并发布带隐私保证的比值指标（基于 epsilon）。
提供一个不可靠指标过滤器，抑制超过 ±10 百分点误差风险的变化。

实验结果

研究问题

RQ1Google 如何在保护个人隐私的同时发布聚合的流动性指标？
RQ2不同地理粒度使用的噪声尺度、隐私参数和每用户贡献界限是什么？
RQ3百分比变化报道的基线是如何构建和应用的？
RQ4哪些标准决定一个指标是否足够可靠发布？

主要发现

指标是在多个粒度层次（国家/地区、顶级分区，以及更高分辨率区域）使用拉普拉斯噪声的差分隐私生成。
每用户贡献被限制以降低隐私风险，按照地理层级每天最多 4 个类别-地点对。
小于 3 km^2 的区域或噪声计数低于 100 的区域将被丢弃，以保护隐私和数据质量。
基线计算使用固定的 5 周窗口，包含工作日匹配的日子，取这些日子的 DP 指标中位数。
一个不可靠指标过滤器在 97.5% 置信区间指示高误差风险（总体>5%）时抑制变化。
该方法对于所述指标是 ε-差分私有，且 δ = 0。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。