QUICK REVIEW

[论文解读] SecureBoost: A Lossless Federated Learning Framework

Kewei Cheng, Tao Fan|arXiv (Cornell University)|Jan 25, 2019

Privacy-Preserving Technologies in Data参考文献 37被引用 178

一句话总结

SecureBoost 提供一个对垂直分割数据进行无损、隐私保护的梯度提升框架，在联邦学习中实现接近集中级别的准确性，同时不泄露私有数据。

ABSTRACT

The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General Data Protection Regulation (GDPR) in the European Union (EU) in May 2018. The GDPR is designed to give users more control over their personal data, which motivates us to explore machine learning frameworks for data sharing that do not violate user privacy. To meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting of federated learning. SecureBoost first conducts entity alignment under a privacy-preserving protocol and then constructs boosting trees across multiple parties with a carefully designed encryption strategy. This federated learning system allows the learning process to be jointly conducted over multiple parties with common user samples but different feature sets, which corresponds to a vertically partitioned data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non-privacy-preserving approach while at the same time, reveals no information of each private data provider. We show that the SecureBoost framework is as accurate as other non-federated gradient tree-boosting algorithms that require centralized data and thus it is highly scalable and practical for industrial applications such as credit risk analysis. To this end, we discuss information leakage during the protocol execution and propose ways to provably reduce it.

研究动机与目标

在联邦设置下定义针对垂直分割数据的隐私保护机器学习。
开发一个在多方之间运行、具有公共样本但特征不同的无损梯度提升框架。
提出安全的数据对齐与加密梯度聚合，以在不暴露私有数据的情况下训练树模型。
分析信息泄露并讨论在保持准确性的前提下可证明降低泄露的方法。

提出的方法

形式性定义拥有标签的主动方与拥有特征的被动方的垂直联邦学习问题。
在隐私约束下使用隐私保护协议对各方的数据样本进行对齐。
通过使用 Paillier 加密对梯度统计量 (g_i, h_i) 进行加密并聚合，以训练共享梯度提升模型并寻找最优分裂。
由主动方对聚合统计量进行解密以确定全局分裂，同时被动方在加密数据上进行本地计算。
在被动方和主动方存储分裂决策信息和查找表，以实现安全预测（推断）。
通过在相同初始化和超参数下，证明联邦模型的损失等同于集中式非隐私保护模型，从而实现无损。

实验结果

研究问题

RQ1在联邦学习中，垂直分割的数据如何在多方之间以隐私方式对齐？
RQ2是否可以在具有加密梯度统计量的前提下，以隐私保护、无损的方式在多方之间训练梯度提升模型？
RQ3在训练和推断期间的入侵/泄漏特征是什么，以及如何在不牺牲准确性的情况下降低泄漏？
RQ4SecureBoost 是否达到与集中式非联邦梯度提升方法相当的准确性？

主要发现

该框架是无损的：SecureBoost 在相同初始化和超参数下达到与集中式非隐私保护模型相同的准确性。
安全性分析显示存在潜在泄漏，主动方可以学到关于实例空间和分裂候选的更多信息；降低泄漏的变体（RL-SecureBoost）降低了泄漏。
对两个信用数据集（Credit 1 和 Credit 2）的实验显示，与非联邦方法具有可比性能，RL-SecureBoost 在降低泄漏的同时保持准确性。
可扩展性分析表明收敛曲线类似于 GBDT 与 XGBoost，运行时间大致随树深度和数据规模线性扩展。
该框架对诸如信用风险分析等工业任务具有实际应用性，且已在 FATE 项目中实现以用于联邦学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。