Skip to main content
QUICK REVIEW

[论文解读] Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

Wenjie Fu, Huandong Wang|arXiv (Cornell University)|Nov 10, 2023
Topic Modeling被引用 10
一句话总结

论文提出 SPV-MIA,是针对微调后的大语言模型的基于记忆的成员识别攻击,使用自提示参考模型来校准概率变异,在 AUC 上超过基线(在报道的比较中约有 23.6%–30% 的增幅)。

ABSTRACT

Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Existing MIAs designed for large language models (LLMs) can be bifurcated into two types: reference-free and reference-based attacks. Although reference-based attacks appear promising performance by calibrating the probability measured on the target model with reference models, this illusion of privacy risk heavily depends on a reference dataset that closely resembles the training set. Both two types of attacks are predicated on the hypothesis that training records consistently maintain a higher probability of being sampled. However, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. Thus, these reasons lead to high false-positive rates of MIAs in practical scenarios. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs. Furthermore, we introduce probabilistic variation, a more reliable membership signal based on LLM memorization rather than overfitting, from which we rediscover the neighbour attack with theoretical grounding. Comprehensive evaluation conducted on three datasets and four exemplary LLMs shows that SPV-MIA raises the AUC of MIAs from 0.7 to a significantly high level of 0.9. Our code and dataset are available at: https://github.com/tsinghua-fib-lab/NeurIPS2024_SPV-MIA

研究动机与目标

  • 在微调 LLM 管道中激发隐私风险并量化超出过拟合假设的成员风险。
  • 开发一种依赖记忆而非过拟合作为成员信号的鲁棒 MIA。
  • 引入自提示技术,从目标 LLM 本身创建一个用于校准的参考模型。
  • 在多种 LLM 和数据集上评估 SPV-MIA,以证明实际隐私泄漏。

提出的方法

  • 将概率变异定义为围绕局部概率最大值的基于记忆的信号。
  • 使用目标文本的改写变体(由掩码填充模型,如 T5)生成来估计概率变异。
  • 通过在目标 LLM 自身提示的数据上训练的自提示参考模型(自提示)对记忆信号进行校准。
  • 将攻击公式化为 A_our(x, θ, φ) = 1[ 〃 p̃_θ(x) - p̃_φ(x) 〃 ≤ τ ],其中 p̃ 表示目标模型和参考模型的概率变异估计。
  • 纳入两阶段工作流程:通过改写的邻域采样评估 p̃_θ,并用在自提示数据上微调的 φ 进行校准。

实验结果

研究问题

  • RQ1SPV-MIA 是否在实际、基于记忆驱动的 LLMs 上超越最先进的 MIAs?
  • RQ2自提示参考模型的质量如何影响攻击表现?
  • RQ3不同微调技术对 SPV-MIA 的影响是什么?
  • RQ4隐私防御能否抵挡 SPV-MIA 攻击?

主要发现

  • SPV-MIA 在四个 LLM 和三个数据集上始终优于基线,平均 AUC 为 92.4%。
  • 与最强基线(LiRA-Candidate)相比,SPV-MIA 在报道的比较中将 AUC 提升约 30%。
  • 摘 要 中 报 告 SPV-MIA 相对于基线在 AUC 上整 体约提升 ~23.6%。
  • 自提示参考模型在没有访问训练分布中的匹配参考数据集的情况下也能被有效校准。
  • 消融研究显示每个 SPV-MIA 模块(概率变异评估和自提示校准)对攻击有效性的贡献。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。