QUICK REVIEW

[논문 리뷰] Membership Inference Attacks against Language Models via Neighbourhood Comparison

Justus Mattern, Fatemehsadat Mireshghallah|arXiv (Cornell University)|2023. 05. 29.

Adversarial Robustness in Machine Learning인용 수 4

한 줄 요약

이 논문은 언어 모델에 대한 이웃 구성원 식별 공격을 제안하며, 참조 모델이 도메인 내 데이터로 훈련된 것을 전제로 하지 않고, 합성적으로 생성된 의미적으로 유사한 이웃 텍스트의 손실과 샘플의 손실을 비교한다. 이로써 도메인 내 데이터로 훈련된 참조 모델이 필요 없어진다. 이 방법은 데이터가 불완전한 경우에도 경쟁 가능한 성능을 달성하며, 기존의 참조 없는 공격 및 불완전한 참조 데이터를 사용하는 공격보다 뛰어나며, 현실적인 위협 모델 하에서 참조 기반 방법에 비해 더 견고한 대안임을 보여준다.

ABSTRACT

Membership Inference attacks (MIAs) aim to predict whether a data sample was present in the training data of a machine learning model or not, and are widely used for assessing the privacy risks of language models. Most existing attacks rely on the observation that models tend to assign higher probabilities to their training samples than non-training points. However, simple thresholding of the model score in isolation tends to lead to high false-positive rates as it does not account for the intrinsic complexity of a sample. Recent work has demonstrated that reference-based attacks which compare model scores to those obtained from a reference model trained on similar data can substantially improve the performance of MIAs. However, in order to train reference models, attacks of this kind make the strong and arguably unrealistic assumption that an adversary has access to samples closely resembling the original training data. Therefore, we investigate their performance in more realistic scenarios and find that they are highly fragile in relation to the data distribution used to train reference models. To investigate whether this fragility provides a layer of safety, we propose and evaluate neighbourhood attacks, which compare model scores for a given sample to scores of synthetically generated neighbour texts and therefore eliminate the need for access to the training data distribution. We show that, in addition to being competitive with reference-based attacks that have perfect knowledge about the training data distribution, our attack clearly outperforms existing reference-free attacks as well as reference-based attacks with imperfect knowledge, which demonstrates the need for a reevaluation of the threat model of adversarial attacks.

연구 동기 및 목표

참조 기반 구성원 식별 공격에서 악성 공격자가 도메인 내 품질이 높은 훈련 데이터로 훈련된 참조 모델에 접근할 수 있다는 비현실적인 가정을 해결한다.
참조 모델의 분포가 타겟 모델의 훈련 데이터와 다를 경우 참조 기반 공격의 취약성을 조사한다.
훈련 데이터 분포에 접근할 필요 없이도 높은 성능을 유지하는 참조 없는 구성원 식별 공격을 설계한다.
데이터 증강을 통해 생성된 이웃을 활용한 구성원 식별을 위한 모델 점수 캘리브레이션을 효과적으로 수행할 수 있음을 보여준다.
구성원 식별 공격의 위협 모델을 재평가하며, 프라이버시에 민감한 환경에서 이웃 기반 방법이 참조 기반 접근 방식보다 더 견고하고 실용적임을 입증한다.

제안 방법

목표 입력에 대해 마스크된 언어 모델을 사용해 단어 교체를 적용하여 의미적으로 유사한 이웃 텍스트를 생성한다.
타겟 언어 모델 하에서 원본 샘플과 각 이웃의 손실을 계산한다.
원본 샘플의 손실을 그 이웃들의 평균 손실과 비교하여 구성원 여부를 결정한다.
학습된 임계값 γ를 사용해 원본 샘플의 손실이 이웃들의 평균 손실보다 유의미하게 낮을 경우, 해당 샘플을 훈련 구성원으로 분류한다.
내재된 샘플 복잡도를 고려하기 위해 외부 참조 모델에 의존하지 않는 이웃 기반 캘리브레이션 메커니즘을 사용한다.
다양한 언어 모델 아키텍처와 데이터셋에서 공격를 훈련 및 평가하며, 기준 참조 기반 및 참조 없는 공격와의 성능을 비교한다.

실험 결과

연구 질문

RQ1참조 모델이 타겟 모델의 훈련 데이터와 다른 분포로 훈련되었을 경우, 참조 기반 구성원 식별 공격의 성능은 어떻게 저하되는가?
RQ2합성적으로 생성된 이웃을 사용한 이웃 비교가 구성원 식별 공격에서 참조 모델의 실질적인 대안이 될 수 있는가?
RQ3제안된 이웃 공격는 완벽한 훈련 데이터 분포 지식이 있는 참조 기반 공격와, 훈련 데이터에 접근할 수 없는 참조 없는 공격와 비교해 성능가 어떻게 되는가?
RQ4단순한 손실 기반 공격에서 관찰된 가짜 양성률을 얼마나 줄일 수 있는가?
RQ5도메인 내 훈련 데이터가 확보되지 않은 프라이버시에 민감한 도메인에서도 이웃 공격는 여전히 효과적인가?

주요 결과

참조 기반 공격, 예를 들어 우도 비율 공격(Likelihood Ratio Attacks, LiRA)는 참조 모델이 타겟 모델의 훈련 데이터와 다른 분포로 훈련되었을 경우 매우 취약하여 성능 저하가 심각하게 발생한다.
제안된 이웃 공격는 훈련 데이터 분포에 대한 지식이 전혀 없음에도 불구하고, 완벽한 지식을 가진 참조 기반 공격와 경쟁 가능한 성능을 달성한다.
이웃 공격는 기존의 참조 없는 공격 및 불완전한 참조 데이터를 사용하는 참조 기반 공격보다 뚜렷이 뛰어나며, 현실적인 위협 모델 하에서의 견고성을 입증한다.
이웃 기반 손실 캘리브레이션을 통해 내재된 샘플 복잡도를 고려함으로써, 가짜 양성률을 효과적으로 감소시킨다.
이 공격는 다양한 언어 모델 아키텍처와 데이터셋에서 효과적이며, 광범위한 적용 가능성을 보여준다.
결과적으로 현재 구성원 식별 공격의 위협 모델은 지나치게 낙관적인 것으로 보이며, 향후 위협 분석 및 방어 전략 수립에 있어 이웃 기반 방법을 고려해야 할 필요가 있음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.