QUICK REVIEW

[论文解读] Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz|arXiv (Cornell University)|Jan 24, 2026

Adversarial Robustness in Machine Learning被引用 0

一句话总结

本论文定义并量化对输入仅限的 PII 记忆在微调后的大语言模型中的表现，分析影响泄漏的因素，并在多个模型和数据集上基准评测隐私保护方法（DP、UnDial、正则化、DPO）

ABSTRACT

Fine-tuning Large Language Models (LLMs) on sensitive datasets carries a substantial risk of unintended memorization and leakage of Personally Identifiable Information (PII), which can violate privacy regulations and compromise individual safety. In this work, we systematically investigate a critical and underexplored vulnerability: the exposure of PII that appears only in model inputs, not in training targets. Using both synthetic and real-world datasets, we design controlled extraction probes to quantify unintended PII memorization and study how factors such as language, PII frequency, task type, and model size influence memorization behavior. We further benchmark four privacy-preserving approaches including differential privacy, machine unlearning, regularization, and preference alignment, evaluating their trade-offs between privacy and task performance. Our results show that post-training methods generally provide more consistent privacy-utility trade-offs, while differential privacy achieves strong reduction in leakage in specific settings, although it can introduce training instability. These findings highlight the persistent challenge of memorization in fine-tuned LLMs and emphasize the need for robust, scalable privacy-preserving techniques.

研究动机与目标

定义并形式化在微调的 LLMs 中的输入仅 PII 记忆。
在现实攻击场景下，使用合成数据集和真实世界数据集来量化记忆。
识别影响记忆的因素（语言、PII 频率、任务类型、模型大小）。
基准评测隐私保护策略并评估隐私与任务性能之间的权衡。

提出的方法

定义 True-Prefix Attack (TPA) 以探测从微调自回归模型中提取 PII。
使用合成数据和真实世界的德语医药数据集，在贪婪解、采样和跨记忆设置下测量泄漏。
评估四种缓解策略：Differential Privacy (DP)、UnDial、Regularization、Direct Preference Optimization (DPO)。
在基于 QLoRA 的微调中将 DP、UnDial、Regularization 和 DPO 融入多种模型大小和架构。
分析前缀长度、模型大小和语言对记忆行为的影响。

Figure 1: Overview of our experiment setup depicting the unintended PII memorization scenario, our attack, and fine-tuning approaches.

实验结果

研究问题

RQ1在微调的 LLMs 中，未预期的输入仅 PII 记忆的现象及其形式定义是什么？
RQ2在现实攻击设置下，跨语言、跨任务和跨模型大小，多少 PII 可以被记忆和提取？
RQ3哪些因素（语言、PII 频率、任务类型、模型大小）影响记忆的严重程度？
RQ4在微调中，隐私保护方法如何在隐私与任务性能之间权衡？

主要发现

训练后缓解方法（DPO、UnDial）通常比预防性方法在隐私–效用权衡方面更具一致性。
差分隐私在某些设置下可以显著降低泄漏，但可能导致训练不稳定和跨运行结果变异。
DP 在某些数据集上通常提供最强的泄漏降低，但在增强攻击下记忆仍然存在。
模型大小和架构会影响记忆，较大模型在未经过微调时就具备更高的揭示 PII 的基线能力。
记忆风险不仅由 PII 频率预测；上下文和任务效用也起着重要作用。

Figure 2: Distribution of per‐token log‑likelihoods for ground‑truth PII completions.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。