QUICK REVIEW

[Paper Review] Certified Defenses for Data Poisoning Attacks

Jacob Steinhardt, Pang Wei Koh|arXiv (Cornell University)|Jun 9, 2017

Adversarial Robustness in Machine Learning50 references83 citations

TL;DR

The paper introduces a framework to certify defenses against data poisoning by deriving approximate upper bounds on worst-case loss for defenses that perform outlier removal followed by empirical risk minimization, and provides a practical attack to nearly match these bounds.

ABSTRACT

Machine learning systems trained on user-provided data are susceptible to data poisoning attacks, whereby malicious users inject false training data with the aim of corrupting the learned model. While recent work has proposed a number of attacks and defenses, little is understood about the worst-case loss of a defense in the face of a determined attacker. We address this by constructing approximate upper bounds on the loss across a broad family of attacks, for defenders that first perform outlier removal followed by empirical risk minimization. Our approximation relies on two assumptions: (1) that the dataset is large enough for statistical concentration between train and test error to hold, and (2) that outliers within the clean (non-poisoned) data do not have a strong effect on the model. Our bound comes paired with a candidate attack that often nearly matches the upper bound, giving us a powerful tool for quickly assessing defenses on a given dataset. Empirically, we find that even under a simple defense, the MNIST-1-7 and Dogfish datasets are resilient to attack, while in contrast the IMDB sentiment dataset can be driven from 12% to 23% test error by adding only 3% poisoned data.

Motivation & Objective

Motivate the need to understand defense robustness under worst-case data poisoning.
Propose a framework to bound the worst-case loss for a class of sanitization defenses.
Develop an efficient online-learning method to compute minimax bounds and generate candidate attacks.
Differentiate fixed (data-independent) and data-dependent defenses to analyze vulnerability.
Demonstrate the framework empirically on image and text datasets to reveal dataset-dependent resilience.

Proposed method

Consider a prediction task with a margin-based loss and a causative data poisoning attack model.
Use data sanitization defenses that remove outliers via a feasible set F and train on the remaining data.
Derive approximate upper bounds on max attack loss using three approximations relating train/test loss and inliers.
Apply online learning to compute the minimax loss M and produce a candidate attack set Dp.
Extend to data-dependent defenses by relaxing to a distribution over Dp and solving a relaxed max problem.
Specify two instantiations: oracle (true class centroids) vs. empirical centroids, and illustrate via Sphere and Slab defenses.

Experimental results

Research questions

RQ1What is the worst-case test loss defenders can incur under data poisoning when using outlier removal followed by empirical risk minimization?
RQ2How can we compute tight upper bounds and construct attacker strategies for fixed vs. data-dependent outlier defenses?
RQ3How does dataset structure (e.g., dimensionality and feature relevance) affect defensibility against poisoning attacks?
RQ4What is the gap between oracle-based resilience and data-dependent defenses in practice?
RQ5Can online-learning-based methods certify resilience and generate near-optimal poisoning strategies?

Key findings

Oracle sphere/slab defenses yield small certified bounds (e.g., upper bound under 0.1) on MNIST-1-7 and Dogfish even with up to 30% poisoned data.
IMDB sentiment data can push test error from 12% to 23% with only 3% poisoned data under the same defense, showing dataset dependence.
Data-dependent defenses can be substantially weaker; MNIST-1-7 and Dogfish attacks grow much more under empirical centroid defenses, with test loss rising significantly at 30% poisoning.
For small poisoning fractions (≤5%), resilience persists on MNIST-1-7 and Dogfish, but larger poisoning allows subverting outlier removal.
On text data, IMDB shows notable vulnerability despite passing oracle defenses, while Enron also exhibits attackability under integrity constraints.
Attack strategies derived from the minimax framework closely track the upper bounds in several experiments, validating the approach.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.