QUICK REVIEW

[Paper Review] CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning

Jinhyun So, Başak Güler|arXiv (Cornell University)|Feb 2, 2019

Privacy-Preserving Technologies in Data44 citations

TL;DR

CodedPrivateML provides information-theoretic privacy for training data in distributed ML while enabling efficient parallelization; it uses quantization and Lagrange coding with polynomial approximation to achieve convergence and privacy against colluding workers.

ABSTRACT

How to train a machine learning model while keeping the data private and secure? We present CodedPrivateML, a fast and scalable approach to this critical problem. CodedPrivateML keeps both the data and the model information-theoretically private, while allowing efficient parallelization of training across distributed workers. We characterize CodedPrivateML's privacy threshold and prove its convergence for logistic (and linear) regression. Furthermore, via extensive experiments on Amazon EC2, we demonstrate that CodedPrivateML provides significant speedup over cryptographic approaches based on multi-party computing (MPC).

Motivation & Objective

Protect the privacy of the training dataset against up to T colluding workers using information-theoretic guarantees.
Enable fast distributed training by effectively parallelizing gradient computations across N workers.
Develop an encoding/quantization scheme based on Lagrange coding to reduce communication and computation overhead.
Ensure convergence of logistic (and linear) regression despite non-polynomial sigmoid operations via polynomial approximation.
Provide a theoretical trade-off analysis between privacy level (T) and parallelization gains.

Proposed method

Quantize the dataset and weights to a finite field via stochastic quantization and two-step secret sharing.
Encode quantized data and weights with Lagrange coding to enable privacy against T colluding workers and distribute workload.
Approximate the sigmoid with a degree-r polynomial to fit into polynomial-based computations.
Compute gradients using a biased-free unbiased estimator ar{s} with r independent quantizations, ensuring convergence.
Decode the aggregated gradient at the master using polynomial interpolation and convert back to the real domain for weight updates.

Experimental results

Research questions

RQ1How can we train ML models on private data in a distributed setting with information-theoretic privacy against colluding workers?
RQ2Can the training procedure converge to the optimum for logistic and linear regression under quantization and polynomial approximation?
RQ3What is the trade-off between privacy (T) and parallelization (N, K) in CodedPrivateML?
RQ4How does CodedPrivateML compare to MPC-based privacy-preserving approaches in terms of speed and accuracy?
RQ5What are the necessary conditions (e.g., recovery threshold) for successful gradient decoding in the presence of straggler workers?

Key findings

CodedPrivateML guarantees convergence to the optimal loss for logistic regression under the proposed quantization and polynomial approximation scheme.
It provides information-theoretic privacy against up to T colluding workers while enabling parallelization across N workers.
The method achieves significant speedup over MPC baselines in experiments on Amazon EC2 with up to 50 workers.
Experiments on CIFAR-10 and GISETTE demonstrate comparable accuracy with substantially faster training times than MPC-based approaches.
There is a trade-off identified between privacy level (T) and parallelization benefits as more workers can either increase privacy or reduce per-worker computation.
The approach encodes data and weights so that coded computations mirror the same structure as uncoded computations, preserving correctness of gradient evaluation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.