Skip to main content
QUICK REVIEW

[Paper Review] High Accuracy and High Fidelity Extraction of Neural Networks

Matthew Jagielski, Nicholas Carlini|arXiv (Cornell University)|Sep 3, 2019
Adversarial Robustness in Machine Learning57 references55 citations
TL;DR

The paper taxonomyizes model extraction around accuracy and fidelity, shows learning-based attacks improve efficiency for accuracy extraction, and introduces the first practical functionally-equivalent extraction attack for direct weight extraction, demonstrating feasibility on large production-grade models.

ABSTRACT

In a model extraction attack, an adversary steals a copy of a remotely deployed machine learning model, given oracle prediction access. We taxonomize model extraction attacks around two objectives: *accuracy*, i.e., performing well on the underlying learning task, and *fidelity*, i.e., matching the predictions of the remote victim classifier on any input. To extract a high-accuracy model, we develop a learning-based attack exploiting the victim to supervise the training of an extracted model. Through analytical and empirical arguments, we then explain the inherent limitations that prevent any learning-based strategy from extracting a truly high-fidelity model---i.e., extracting a functionally-equivalent model whose predictions are identical to those of the victim model on all possible inputs. Addressing these limitations, we expand on prior work to develop the first practical functionally-equivalent extraction attack for direct extraction (i.e., without training) of a model's weights. We perform experiments both on academic datasets and a state-of-the-art image classifier trained with 1 billion proprietary images. In addition to broadening the scope of model extraction research, our work demonstrates the practicality of model extraction attacks against production-grade systems.

Motivation & Objective

  • Motivate and define two adversarial objectives in model extraction: accuracy and fidelity.
  • Systematize existing extraction attacks within a two-dimensional objective space.
  • Demonstrate limitations of learning-based extraction for achieving high fidelity.
  • Develop practical functionally-equivalent extraction for direct weight recovery.
  • Showcase attacks on both academic datasets and a state-of-the-art production classifier.

Proposed method

  • Taxonomy of attack objectives and capabilities (accuracy, fidelity, functionally equivalent extraction).
  • Develop learning-based extraction leveraging victim model as labeling oracle to maximize task accuracy.
  • Prove inherent limitations of learning-based strategies for high-fidelity extraction.
  • Propose a practical functionally-equivalent extraction attack that recovers a two-layer network’s weights from input-output access.
  • Evaluate attacks on ImageNet-scale model (WSL) and on standard datasets (SVHN, CIFAR-10).
  • Explore semi-supervised and mix-methods (rotation loss, MixMatch) to improve query efficiency.

Experimental results

Research questions

  • RQ1Can model extraction reach functionally equivalent fidelity under realistic query-access constraints?
  • RQ2How do learning-based extractions compare to fidelity-focused extractions in terms of query efficiency and scalability?
  • RQ3What are the fundamental limits of learning-based extraction for high fidelity, and can direct weight recovery be achieved without data-side channels?
  • RQ4How do unlabeled data and semi-supervised techniques affect the practicality of extraction attacks on large models?
  • RQ5Do production-grade models trained on massive proprietary data remain vulnerable to practical extraction under black-box access?

Key findings

  • Learning-based extraction improves accuracy extraction and is more query-efficient than prior methods, scaling to millions of parameters.
  • Unlabeled data and semi-supervised techniques (rotation loss, MixMatch) significantly improve extraction performance with fewer queries.
  • Functionally-equivalent extraction attacks are practical for direct recovery of a two-layer network’s weights using only input-output access.
  • Learning-based approaches face inherent fidelity limitations, with experiments showing fidelity caps around ~93% under controlled non-determinism.
  • MixMatch-based extraction with 250 queries can nearly match oracle accuracy on SVHN and CIFAR-10 with substantially fewer labeled queries.
  • The work demonstrates the practicality of model extraction against production-grade systems and provides theoretical bounds on extraction hardness.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.