QUICK REVIEW

[論文レビュー] Membership Inference Attacks against Language Models via Neighbourhood Comparison

Justus Mattern, Fatemehsadat Mireshghallah|arXiv (Cornell University)|May 29, 2023

Adversarial Robustness in Machine Learning被引用数 4

ひとこと要約

本稿では、言語モデルにおける近隣メンバーシップインファレンス攻撃を提案し、元のサンプルの損失を、合成的に生成された意味的に類似した隣接テキストの損失と比較することで、ドメイン内データで訓練された参照モデルの必要性を排除する。この手法は、不完全なデータでさえも競争的な性能を発揮し、既存の参照フリーおよび不完全な参照を持つ攻撃を上回り、現実的な脅威モデル下で参照ベース手法の代替手段として、近隣比較が堅牢であることを示している。

ABSTRACT

Membership Inference attacks (MIAs) aim to predict whether a data sample was present in the training data of a machine learning model or not, and are widely used for assessing the privacy risks of language models. Most existing attacks rely on the observation that models tend to assign higher probabilities to their training samples than non-training points. However, simple thresholding of the model score in isolation tends to lead to high false-positive rates as it does not account for the intrinsic complexity of a sample. Recent work has demonstrated that reference-based attacks which compare model scores to those obtained from a reference model trained on similar data can substantially improve the performance of MIAs. However, in order to train reference models, attacks of this kind make the strong and arguably unrealistic assumption that an adversary has access to samples closely resembling the original training data. Therefore, we investigate their performance in more realistic scenarios and find that they are highly fragile in relation to the data distribution used to train reference models. To investigate whether this fragility provides a layer of safety, we propose and evaluate neighbourhood attacks, which compare model scores for a given sample to scores of synthetically generated neighbour texts and therefore eliminate the need for access to the training data distribution. We show that, in addition to being competitive with reference-based attacks that have perfect knowledge about the training data distribution, our attack clearly outperforms existing reference-free attacks as well as reference-based attacks with imperfect knowledge, which demonstrates the need for a reevaluation of the threat model of adversarial attacks.

研究の動機と目的

参照ベースのメンバーシップインファレンス攻撃が、高品質でドメイン内データで訓練された参照モデルへのアクセスを持つという現実的でない仮定を是正すること。
参照データの分布がターゲットモデルの学習データとは異なる場合に、参照ベース攻撃の脆さを調査すること。
学習データの分布にアクセスできない状況下でも高い性能を維持できる参照フリーのメンバーシップインファレンス攻撃を設計すること。
データ拡張による近隣サンプルを用いた近隣比較が、メンバーシップインファレンスのためのモデルスコアのキャリブレーションに効果的に機能することを示すこと。
プライバシー感受性の高い環境において、近隣ベースのアプローチが参照ベース手法よりも堅牢かつ実用的であることを示し、メンバーシップインファレンス攻撃の脅威モデルを再評価すること。

提案手法

マスクド言語モデルを用いて語の置換を施し、ターゲット入力の意味的に類似した近隣テキストを生成する。
ターゲット言語モデル下で、元のサンプルおよび各近隣テキストの損失を計算する。
元のサンプルの損失をその近隣の平均損失と比較し、メンバーシップ状態を特定する。
学習された閾値γを用い、元のサンプルの損失が近隣の平均損失よりも顕著に低い場合に、そのサンプルを訓練データに属するものと分類する。
外部の参照モデルに依存せずに、内在的なサンプルの複雑さを考慮する近隣ベースのキャリブレーション機構を用いる。
複数の言語モデルアーキテクチャおよびデータセットで攻撃を訓練・評価し、ベースラインの参照ベースおよび参照フリー攻撃と性能を比較する。

実験結果

リサーチクエスチョン

RQ1参照モデルがターゲットモデルの学習データとは異なる分布で訓練された場合、参照ベースのメンバーシップインファレンス攻撃の性能はどの程度低下するか？
RQ2合成的に生成された近隣を用いた近隣比較が、メンバーシップインファレンス攻撃において参照モデルの代替手段として実用的であるか？
RQ3提案手法の近隣攻撃は、完全な学習データ分布の知識を持つ参照ベース攻撃および学習データへのアクセスなしの参照フリー攻撃と比較して、性能的にどの程度優れているか？
RQ4単純な損失ベース攻撃で観察される偽陽性バイアスは、近隣ベースの方法によってどの程度軽減されるか？
RQ5ドメイン内学習データが入手不可能なプライバシー感受性の高い分野においても、近隣攻撃は有効に機能するか？

主な発見

LiRA（尤度比攻撃）を含む参照ベース攻撃は、参照モデルの学習分布がターゲットモデルとは異なる場合に顕著な脆さを示し、性能が著しく低下する。
提案された近隣攻撃は、学習データ分布の完全な知識がなくても、参照ベース攻撃と同等の性能を達成する。
近隣攻撃は、既存の参照フリー攻撃および不完全な参照データを用いた参照ベース攻撃を顕著に上回り、現実的な脅威モデル下での堅牢性を示している。
近隣ベースの損失キャリブレーションにより、内在的なサンプルの複雑さを考慮することで、偽陽性率が効果的に低減される。
この攻撃は、複数の言語モデルアーキテクチャおよびデータセットで有効に機能し、広範な適用可能性を示している。
結果から、現在のメンバーシップインファレンス攻撃の脅威モデルは楽観的すぎる可能性があり、今後の脅威および防御分析において近隣ベース手法を検討すべきであると示唆される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。