QUICK REVIEW

[论文解读] Multi-hop Federated Private Data Augmentation with Sample Compression

Eunjeong Jeong, Seungeun Oh|arXiv (Cornell University)|Jul 15, 2019

Privacy-Preserving Technologies in Data参考文献 13被引用 21

一句话总结

本文提出了一种名为多跳联邦增强与样本压缩（MultFAug）的隐私保护数据增强框架，用于设备端机器学习。该框架通过多跳中继和样本压缩技术，降低通信延迟并增强数据隐私。通过压缩种子样本并经由中间设备中继传输，MultFAug 在优化跳数和压缩率的前提下，提升了标签隐私和通信效率，同时保持了高模型准确率。

ABSTRACT

On-device machine learning (ML) has brought about the accessibility to a tremendous amount of data from the users while keeping their local data private instead of storing it in a central entity. However, for privacy guarantee, it is inevitable at each device to compensate for the quality of data or learning performance, especially when it has a non-IID training dataset. In this paper, we propose a data augmentation framework using a generative model: multi-hop federated augmentation with sample compression (MultFAug). A multi-hop protocol speeds up the end-to-end over-the-air transmission of seed samples by enhancing the transport capacity. The relaying devices guarantee stronger privacy preservation as well since the origin of each seed sample is hidden in those participants. For further privatization on the individual sample level, the devices compress their data samples. The devices sparsify their data samples prior to transmissions to reduce the sample size, which impacts the communication payload. This preprocessing also strengthens the privacy of each sample, which corresponds to the input perturbation for preserving sample privacy. The numerical evaluations show that the proposed framework significantly improves privacy guarantee, transmission delay, and local training performance with adjustment to the number of hops and compression rate.

研究动机与目标

解决设备端机器学习中非独立同分布（non-IID）、数据稀缺且敏感的挑战。
在联邦数据增强中降低通信开销和上行链路延迟，同时保持强隐私保障。
通过种子样本的多跳中继，隐藏单个设备的数据分布，从而增强标签隐私。
通过在传输前从样本中移除随机位来实现样本级隐私保护，即通过数据压缩实现。
联合优化跳数（M）和压缩率（ρ），以在通信效率、隐私和模型性能之间取得平衡。

提出的方法

设备使用多跳协议，通过中间设备中继种子样本，缩短单跳距离，从而降低端到端传输延迟。
每个设备通过随机删除位（压缩率 ρ）来压缩其种子样本，减少通信负载，并通过输入扰动增强样本级隐私。
为保护标签隐私，设备在其公共数据分布指示器（SDI）中插入虚假标签指示，隐藏其真实的私有SDI，防止被直接观测。
边缘服务器从多个设备收集过采样的种子样本，使用这些压缩后且经多跳传输的样本训练条件生成对抗网络（cGAN）生成器。
每个设备下载已训练的cGAN生成器，并在本地使用其进行数据增强，以提升设备端模型训练效果。
系统联合优化跳数（M）和压缩率（ρ），以在延迟、隐私和模型准确率之间实现平衡。

实验结果

研究问题

RQ1多跳通信如何影响联邦数据增强中的端到端延迟和隐私？
RQ2样本压缩对设备端学习中的通信开销和样本级隐私有何影响？
RQ3跳数如何影响标签隐私和训练生成器的质量？
RQ4在F1分数和样本质量方面，压缩率（ρ）与生成器性能之间的最优权衡是什么？
RQ5延迟和标签隐私约束如何共同影响所提框架中本地模型的测试准确率？

主要发现

在延迟截止时间 τ=25 的条件下，测试准确率在 2–3 跳时达到最大值，因为上行链路延迟最小化；而更高的跳数在严格截止时间内会降低准确率。
标签隐私保障随跳数增加而提升，在严格截止时间（τ=25）下于 M=4 时达到峰值，表明存在一个最优跳数以实现隐私最大化。
在压缩率 ρ=0.15 时，cGAN 生成器无法生成数字 0、1、2 和 6 的增强样本，表明高比例压缩会降低生成器性能。
随着压缩率 ρ 增加，训练生成器的 F1 分数下降，这是由于训练样本更嘈杂，但样本隐私得到提升。
在较长延迟截止时间（τ）下，无论采用何种协议，更高的跳数和更多的收集种子样本均能带来更好的测试准确率。
该框架在测试准确率上与单跳 FAug 相当，但延迟更低、隐私更强，尤其在对跳数和压缩率进行优化后表现更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。