QUICK REVIEW

[Paper Review] Encrypted statistical machine learning: new privacy preserving methods

Louis J. M. Aslett, Pedro M. Esperança|arXiv (Cornell University)|Aug 27, 2015

Privacy-Preserving Technologies in Data15 references52 citations

TL;DR

This paper proposes two novel privacy-preserving machine learning methods—encrypted extremely random forests and encrypted naïve Bayes—using fully homomorphic encryption (FHE) to enable secure learning and prediction on encrypted data without decryption. The authors introduce a cryptographic stochastic fraction estimator for random forests and a semi-parametric model with logistic regression for decision boundaries, demonstrating competitive performance and exact equivalence to unencrypted models on UCI datasets, with a 100-tree forest trained in 1h36m on 1,152 cores at a cost of $23.86.

ABSTRACT

We present two new statistical machine learning methods designed to learn on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes following Gentry (2009) opens up the prospect of privacy preserving statistical machine learning analysis and modelling of encrypted data without compromising security constraints. We propose tailored algorithms for applying extremely random forests, involving a new cryptographic stochastic fraction estimator, and naïve Bayes, involving a semi-parametric model for the class decision boundary, and show how they can be used to learn and predict from encrypted data. We demonstrate that these techniques perform competitively on a variety of classification data sets and provide detailed information about the computational practicalities of these and other FHE methods.

Motivation & Objective

To enable end-to-end encrypted machine learning for statistical models without multi-party computation.
To address the practical limitations of fully homomorphic encryption (FHE) in real-world machine learning applications.
To develop tailored algorithms that preserve model accuracy while operating entirely on encrypted data.
To demonstrate computational feasibility and performance of FHE-based learning on large-scale data using cloud infrastructure.
To provide an open-source R implementation for reproducible and accessible privacy-preserving machine learning.

Proposed method

Proposes a cryptographic stochastic fraction estimator to approximate voting in extremely random forests under FHE, enabling secure tree construction.
Develops a semi-parametric naïve Bayes model that uses logistic regression to define class decision boundaries, compatible with homomorphic operations.
Adapts the original random forest and naïve Bayes algorithms to operate solely on encrypted data using homomorphic encryption primitives.
Employs homomorphic encryption to perform all operations—training, prediction, and model combination—without decryption.
Utilizes a distributed, embarrassingly parallel architecture on Amazon EC2 using spot instances to scale training across 1,152 CPU cores.
Designs a job dispatch system using Amazon SQS and S3 to coordinate encrypted computation across geographically dispersed nodes without inter-node communication.

Experimental results

Research questions

RQ1Can extremely random forests be adapted to operate entirely on encrypted data using fully homomorphic encryption?
RQ2Can a semi-parametric naïve Bayes model be constructed to support homomorphic computation of decision boundaries?
RQ3How does the performance of encrypted machine learning models compare to their unencrypted counterparts on standard benchmark datasets?
RQ4What are the practical computational costs and scalability characteristics of FHE-based machine learning on cloud infrastructure?
RQ5Can encrypted models be combined homomorphically to produce a single, unified model without decryption?

Key findings

The encrypted extremely random forest and naïve Bayes models achieved classification performance competitive with their unencrypted counterparts on multiple UCI datasets.
The encrypted model results were bit-for-bit identical to unencrypted computations when decrypted, confirming correctness of the homomorphic implementation.
A 100-tree random forest was trained in 1 hour and 36 minutes using 1,152 CPU cores across two cloud regions, costing $23.86 via Amazon EC2 spot instances.
The final encrypted forest of 100 trees required only 868MB of storage, compared to 15.6GB for 36 separate 50-tree forests, enabling significant long-term data compression.
The method supports full model fitting and prediction in encrypted form, eliminating the need for multi-party computation or secure communication channels.
The approach scales efficiently due to the use of homomorphic addition and multiplication, which are natively supported and parallelizable on modern CPUs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.