QUICK REVIEW

[論文レビュー] Membership Inference Attacks against Synthetic Data through Overfitting Detection

Boris van Breugel, Hao Sun|arXiv (Cornell University)|Feb 24, 2023

Privacy-Preserving Technologies in Data被引用数 25

ひとこと要約

本論文は DOMIAS を紹介する。これは参照データセットを利用して局所的過剰適合を検出し、プライバシーリスクを評価する合成データに対する密度ベースのメンバーシップ推定攻撃である。

ABSTRACT

Data is the foundation of most science. Unfortunately, sharing data can be obstructed by the risk of violating data privacy, impeding research in fields like healthcare. Synthetic data is a potential solution. It aims to generate data that has the same distribution as the original data, but that does not disclose information about individuals. Membership Inference Attacks (MIAs) are a common privacy attack, in which the attacker attempts to determine whether a particular real sample was used for training of the model. Previous works that propose MIAs against generative models either display low performance -- giving the false impression that data is highly private -- or need to assume access to internal generative model parameters -- a relatively low-risk scenario, as the data publisher often only releases synthetic data, not the model. In this work we argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model. Experimentally we show that DOMIAS is significantly more successful at MIA than previous work, especially at attacking uncommon samples. The latter is disconcerting since these samples may correspond to underrepresented groups. We also demonstrate how DOMIAS' MIA performance score provides an interpretable metric for privacy, giving data publishers a new tool for achieving the desired privacy-utility trade-off in their synthetic data.

研究の動機と目的

攻撃者が基礎データ分布のある程度の知識を持つ合成データにおける現実的な MIA 設定を動機づける。
参照分布と比較した生成データ密度と実データ密度を用いて局所的過剰適合を検出する DOMIAS を提案する。
DOMIAS が従来の MIAs を上回ることを示す。特に代表性の低い（密度が低い）サンプルで顕著。
DOMIAS がプライバシー配慮型の合成データ生成と評価を導く方法を示す。

提案手法

DOMIAS を A_DOMIAS(x*) = f(p_G(x*) / p_R(x*)) と定義し、真のデータ密度で生成データ密度を重み付ける。
参照データセット D_ref を用いて p_R(X) を近似し、p_G(X) の密度推定器を用いて MIA スコアを計算する。
密度比ベースのスコアが可逆なデータ表現に対して不変であることを主張・証明する（定理1）。
p_G のみに依存する従来の MIAs と DOMIAS を比較し、D_ref へのアクセスがより強力な攻撃を生むことを示す。
さまざまな生成モデル（TVAE、GAN系）とベースラインで、データセット（例: California Housing、Heart Failure）に対して実証的に DOMIAS を評価する。

(a) Generative distribution in original space

実験結果

リサーチクエスチョン

RQ1密度比ベースの MIA は、参照分布を取り入れることで従来のブラックボックス MIA を上回ることができるか。
RQ2p_G/p_R を用いて局所的過剰適合を狙うことは、特に代表性の低いグループを表す低密度領域でメンバーシップの検出を改善するか。
RQ3異なる生成モデルとデータセットサイズで DOMIAS はどのように性能を示すか、またプライバシーと有用性のトレードオフはどうなるか。
RQ4DOMIAS はプライバシー保護型の合成データ生成を導く実用的な指標として機能するか。

主な発見

DOMIAS はデータセットとモデルを問わず、MIA の精度で一貫してベースラインを上回る。
参照分布（Eq. 2）を用いると、従来の Eq. 1 ベースの攻撃よりも大きな利得をもたらす。
DOMIAS は合成データにおける少数派/過小表現グループへの攻撃により効果的である。
攻撃スコアはプライバシーリスクを解釈可能な指標として機能し、プライバシー-有用性のトレードオフの意思決定を支援する。
明確なプライバシー-有用性のパレート前線がある。データ品質が高いほどモデルによっては MIA の脆弱性が高まることがある。

(b) Distribution in log-transformed space

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。