QUICK REVIEW

[论文解读] The Tradeoff Between Privacy and Accuracy in Anomaly Detection Using Federated XGBoost

Mengwei Yang, Linqi Song|arXiv (Cornell University)|Jul 16, 2019

Privacy-Preserving Technologies in Data参考文献 28被引用 23

一句话总结

本文提出了一种用于异常检测的横向联邦XGBoost框架，通过数据聚合与稀疏联邦更新，在隐私保护与模型准确性之间实现平衡。通过将用户数据分组为虚拟样本并聚焦于分类错误的样本进行更新，该方法在F1-score上相比最先进方法最高提升5%，AUC提升3.4%，并可调节集群大小以实现隐私（越大集群越隐私）与性能之间的权衡。

ABSTRACT

Privacy has raised considerable concerns recently, especially with the advent of information explosion and numerous data mining techniques to explore the information inside large volumes of data. In this context, a new distributed learning paradigm termed federated learning becomes prominent recently to tackle the privacy issues in distributed learning, where only learning models will be transmitted from the distributed nodes to servers without revealing users' own data and hence protecting the privacy of users. In this paper, we propose a horizontal federated XGBoost algorithm to solve the federated anomaly detection problem, where the anomaly detection aims to identify abnormalities from extremely unbalanced datasets and can be considered as a special classification problem. Our proposed federated XGBoost algorithm incorporates data aggregation and sparse federated update processes to balance the tradeoff between privacy and learning performance. In particular, we introduce the virtual data sample by aggregating a group of users' data together at a single distributed node. We compute parameters based on these virtual data samples in the local nodes and aggregate the learning model in the central server. In the learning model upgrading process, we focus more on the wrongly classified data before in the virtual sample and hence to generate sparse learning model parameters. By carefully controlling the size of these groups of samples, we can achieve a tradeoff between privacy and learning performance. Our experimental results show the effectiveness of our proposed scheme by comparing with existing state-of-the-arts.

研究动机与目标

解决在使用敏感且不平衡数据集进行分布式异常检测时的隐私风险。
设计一种联邦学习框架，在保护用户数据隐私的同时保持高检测准确性。
探索在联邦XGBoost中通过数据聚类实现的隐私（数据聚类）与模型性能之间的权衡。
通过聚焦于分类错误样本的稀疏模型更新，提升学习效率并降低通信成本。
在真实世界欺诈检测场景中，特别是在高度不平衡数据设置下，验证所提出框架的有效性。

提出的方法

通过在修改后的K-匿名机制下对用户数据进行分组，实现数据聚合，创建虚拟数据样本，从而在保护隐私的同时支持分裂增益的计算。
框架利用来自虚拟样本的聚合特征序列计算分裂增益，避免直接传输原始用户数据。
采用两步流程——先进行数据聚合，再执行联邦模型更新——实现在不暴露个体数据的前提下保护隐私的训练。
通过优先处理分类错误样本的梯度，应用稀疏联邦更新，降低通信开销并提升收敛速度。
基于来自虚拟样本的聚合梯度，在中心位置更新模型参数，确保各节点间模型的一致性。
在虚拟数据聚合中调整集群大小，以控制隐私（集群越大隐私越高）与准确性（集群越小准确性越高）之间的权衡。

实验结果

研究问题

RQ1如何在联邦异常检测中保护用户数据隐私，同时不损害模型准确性？
RQ2虚拟数据聚类大小对隐私与检测性能之间权衡的影响是什么？
RQ3聚焦于分类错误样本的稀疏联邦更新是否能提升学习效率与模型准确性？
RQ4所提出的联邦XGBoost框架在不平衡数据集上的异常检测中，与现有最先进方法相比表现如何？
RQ5两步式数据聚合与模型更新流程在多大程度上提升了隐私保护，同时保持了模型性能？

主要发现

所提出的联邦XGBoost框架在原始数据维度上实现了0.9014的F1-score，相比GBDT与随机森林最高提升5%的F1-score。
当虚拟聚类大小减小至405时，F1-score下降至0.8951，表明隐私与准确性之间存在可量化的权衡。
更新后的联邦XGBoost模型在原始维度上的AUC达到0.9748，相比基线提升3.4%。
AUPRC结果表明更新后性能持续提升，训练集与测试集上的精确率与召回率均更高。
联邦XGBoost框架的训练损失下降速度超过GBDT，表明收敛更快。
该模型在所有配置下均保持高精度（0.9997），证实仅依赖准确率不足以评估在不平衡数据集上的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。