QUICK REVIEW

[论文解读] Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

Nicolas Papernot, Martı́n Abadi|arXiv (Cornell University)|Oct 18, 2016

Privacy-Preserving Technologies in Data被引用 320

一句话总结

本论文提出 PATE，一种私有知识转移框架，使用在分离敏感数据上训练的教师集合为未标记的公开数据打标签，以训练学生模型，具备强差分隐私保证并通过半监督学习维持高效用。

ABSTRACT

Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information. To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data: Private Aggregation of Teacher Ensembles (PATE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as "teachers" for a "student" model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student's privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student's training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings. Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning.

研究动机与目标

在对敏感数据训练高效用模型的同时，为训练数据提供强隐私保障。
开发一个对底层学习算法不敏感的黑箱知识转移框架。
通过限制学生访问教师知识和半监督学习来降低隐私损失。
探索基于 GAN 的半监督变体（PATE-G）以进一步改善隐私与效用之间的平衡。

提出的方法

将敏感数据划分成 n 个不相交子集，并在每个子集上训练一个独立的教师（教师集成）。
在未标记的公开数据上对教师预测进行拉普拉斯噪声聚合以保护隐私，选择带受控噪声的多数票。
在带有标注的带噪聚合数据和未标记公开数据上训练学生模型，从而实现隐私保护的知识转移。
将生成对抗网络（GAN）用于半监督学习，在标注有限时提升学生性能（PATE-G）。
使用 moments accountant 框架来分析并界定整体过程的差分隐私保证（epsilon, delta）。
包括一个数据相关的隐私分析，当教师法定人数充足时收紧界限。

实验结果

研究问题

RQ1一个私有训练教师的黑箱集成是否能够为在敏感数据上的学习提供差分隐私保证？
RQ2如何将半监督学习和 GANs 融合，以在 PATE 中在保持隐私的同时最大化效用？
RQ3在 PATE 和 PATE-G 下，MNIST 和 SVHN 的实际隐私-效用权衡是什么？
RQ4教师数量和法定人数差如何影响隐私损失和准确性？

主要发现

PATE 方法在 MNIST（ε=2.04，δ=1e-5，准确率 98.00%）和 SVHN（ε=8.19，δ=1e-6，准确率 90.66%）上在保持高准确率的同时提供了有意义的差分隐私保证。
在使用 250 个教师时，MNIST 和 SVHN 的聚合教师预测分别达到 93.18% 和 87.79% 的准确率，每次查询的隐私成本为 ε=0.05。
使用半监督的 GAN 基于训练（PATE-G）减少了所需标记查询的数量，并相较于以往方法改善了隐私-效用权衡。
与非私有基线相比，PATE 达到有竞争力的准确率（MNIST 非私有 99.18% 与私有的 98.00%；SVHN 非私有 92.80% 与私有的 90.66%）。
该框架对架构无关，适用于非凸模型，提供广泛适用的隐私保护学习策略。
附录结果显示 PATE 还能在其他数据类型上保护隐私，包括使用随机森林的医疗数据。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。