QUICK REVIEW

[论文解读] Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption

Stephen Hardy, Wilko Henecka|arXiv (Cornell University)|Nov 29, 2017

Privacy-Preserving Technologies in Data参考文献 29被引用 429

一句话总结

本论文提出了一种三方、在隐私保护下的解决方案，针对垂直分割的数据，其中两个提供方在隐私保护的实体解析和加法同态加密下联合学习逻辑回归模型，并分析实体解析错误如何影响学习。

ABSTRACT

Consider two data providers, each maintaining private records of different feature sets about common entities. They aim to learn a linear model jointly in a federated setting, namely, data is local and a shared model is trained from locally computed updates. In contrast with most work on distributed learning, in this scenario (i) data is split vertically, i.e. by features, (ii) only one data provider knows the target variable and (iii) entities are not linked across the data providers. Hence, to the challenge of private learning, we add the potentially negative consequences of mistakes in entity resolution. Our contribution is twofold. First, we describe a three-party end-to-end solution in two phases ---privacy-preserving entity resolution and federated logistic regression over messages encrypted with an additively homomorphic scheme---, secure against a honest-but-curious adversary. The system allows learning without either exposing data in the clear or sharing which entities the data providers have in common. Our implementation is as accurate as a naive non-private solution that brings all data in one place, and scales to problems with millions of entities with hundreds of features. Second, we provide what is to our knowledge the first formal analysis of the impact of entity resolution's mistakes on learning, with results on how optimal classifiers, empirical losses, margins and generalisation abilities are affected. Our results bring a clear and strong support for federated learning: under reasonable assumptions on the number and magnitude of entity resolution's mistakes, it can be extremely beneficial to carry out federated learning in the setting where each peer's data provides a significant uplift to the other.

研究动机与目标

在不暴露原始数据或公共实体映射的情况下，推动对两方持有的垂直分割数据进行学习。
开发一个隐私保护的实体解析协议，使跨提供方的记录对齐，同时保持标识符的秘密。
利用加法同态加密和第三方协调者实现安全的联邦逻辑回归。
提供关于实体解析错误如何影响最优分类器、损失、边缘和泛化的正式分析。
展示在具有百万级实体和数百维特征的数据集上可扩展性，同时维持接近集中式非私有解的准确性。

提出的方法

提出一个包含协调者（C）的端到端三方流水线，执行隐私保护实体解析和安全的逻辑回归。
使用密码学长期密钥（CLKs）和基于Bloom过滤器的编码，通过Dice相似度在各方之间私下链接实体。
用加法同态加密方案（例如 Paillier）对学习过程进行加密，以在不暴露原始数据的情况下计算梯度和更新。
采用基于Taylor级数的损失近似（Taylor loss）以实现梯度的加密计算和用于早停的保留损失。
在学习过程中加入加密掩码，以处理实体解析结果而不披露私有联接信息。
实现垂直分割特征的安全联邦SGD（特别关注SAG），确保仅将加密交换传输给协调者。

实验结果

研究问题

RQ1隐私保护的实体解析如何影响联合学习模型相对于集中式非私有解的准确性？
RQ2在实体解析错误发生时，可以对最优分类器的偏差建立哪些正式界限？
RQ3在何种条件下学习到的分类器对实体解析错误具有鲁棒性，特别是对于大边界的样本？
RQ4在实体解析错误下，安全联邦逻辑回归的收敛性和泛化行为具有哪些性质？
RQ5提出的系统在拥有数百万实体和数百维特征的数据集上的可扩展性如何，同时保持隐私保证？

主要发现

端到端系统的学习与聚合所有数据的天真非私有解一样准确，且能扩展到大型问题。
本工作首次对实体解析错误如何影响学习提供正式分析，包括分类器偏差的界限以及对经验损失和泛化的影响。
在合理假设下，大边界样本在实体解析错误存在时仍能正确分类，表明鲁棒性。
当实体解析错误数量较小时，泛化没有显著影响，经验损失以取决于三个惩罚项（最佳分类器、解析错误、类别统计）的速率收敛。
该方法支持带来显著提升的分类准确性的联邦学习，当数据伙伴的特征互补时，正当化隐私保护协作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。