[論文レビュー] Privacy for Free: How does Dataset Condensation Help Privacy?
この論文は、dataset condensation (DC) がトレーニングを加速するだけでなく、プライバシー保護にも利点を提供することを示し、DCを差分プライバシーと結びつけ、はるかに大きな生データセットからデータを合成する場合のプライバシー漏洩が限定的であることを証明している。
To prevent unintentional data leakage, research community has resorted to data generators that can produce differentially private data for model training. However, for the sake of the data privacy, existing solutions suffer from either expensive training cost or poor generalization performance. Therefore, we raise the question whether training efficiency and privacy can be achieved simultaneously. In this work, we for the first time identify that dataset condensation (DC) which is originally designed for improving training efficiency is also a better solution to replace the traditional data generators for private data generation, thus providing privacy for free. To demonstrate the privacy benefit of DC, we build a connection between DC and differential privacy, and theoretically prove on linear feature extractors (and then extended to non-linear feature extractors) that the existence of one sample has limited impact ($O(m/n)$) on the parameter distribution of networks trained on $m$ samples synthesized from $n (n \gg m)$ raw samples by DC. We also empirically validate the visual privacy and membership privacy of DC-synthesized data by launching both the loss-based and the state-of-the-art likelihood-based membership inference attacks. We envision this work as a milestone for data-efficient and privacy-preserving machine learning.
研究の動機と目的
- Motivate the use of dataset condensation (DC) as a data-efficient alternative to DP-based data generators for private data generation.
- Theoretically connect DC to differential privacy and characterize privacy loss under DC-based data synthesis.
- Empirically evaluate membership privacy and visual privacy of DC-synthesized data against MIA attacks on image datasets.
提案手法
- Analyze the relationship between DC-synthesized data and original data using propositions on linear and non-linear extractors.
- Prove that removing a single original sample changes the parameter distribution of models trained on DC-synthesized data by O(m/n).
- Relate the DC privacy bound to a DP-like framework via model parameter distribution and empirical DP budget estimation.
- Empirically assess loss-based MIA and LiRA against models trained on DC-synthesized data, and evaluate visual privacy via similarity metrics.
実験結果
リサーチクエスチョン
- RQ1How does dataset condensation affect membership privacy when training on DC-synthesized data?
- RQ2Can we theoretically bound the privacy loss introduced by DC (linear and non-linear extractors) in terms of the original and synthetic dataset sizes?
- RQ3Do DC-synthesized data reduce adversary success in loss-based MIA and LiRA compared to DP-generators and GANs?
- RQ4Is the visual privacy of DC-synthesized data preserved against direct matching attacks?
- RQ5How does DC compare to DP-based data generators in terms of privacy-utility trade-offs for image-based tasks?
主な発見
- Models trained on DC-synthesized data achieve strong privacy protection, with empirical DP budget ε̂ around 2 against LiRA-based MIA.
- DC-synthesized data can preserve data efficiency and membership privacy while enabling higher accuracy than DP-generators under similar privacy budgets.
- DC methods reduce training data requirements by up to 50% compared to GAN-based methods while speeding up training by at least 2×.
- Theoretical results show that removing one sample from the original data alters model parameters only by O(m/n) when training on m DC-synthesized samples from n raw samples (n ≫ m).
- DC-synthesized data are not perceptually similar to original data and cannot be reverse-engineered to the originals via LPIPS or simple similarity metrics.]
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。