QUICK REVIEW

[論文レビュー] Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz|arXiv (Cornell University)|Jan 24, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

この論文は、微調整済みLLMにおける入力専用PII記憶を定義・定量化し、漏洩に影響を与える要因を分析し、複数のモデルとデータセットを横断してプライバシー保護アプローチ（DP、UnDial、正則化、DPO）をベンチマークします。

ABSTRACT

Fine-tuning Large Language Models (LLMs) on sensitive datasets carries a substantial risk of unintended memorization and leakage of Personally Identifiable Information (PII), which can violate privacy regulations and compromise individual safety. In this work, we systematically investigate a critical and underexplored vulnerability: the exposure of PII that appears only in model inputs, not in training targets. Using both synthetic and real-world datasets, we design controlled extraction probes to quantify unintended PII memorization and study how factors such as language, PII frequency, task type, and model size influence memorization behavior. We further benchmark four privacy-preserving approaches including differential privacy, machine unlearning, regularization, and preference alignment, evaluating their trade-offs between privacy and task performance. Our results show that post-training methods generally provide more consistent privacy-utility trade-offs, while differential privacy achieves strong reduction in leakage in specific settings, although it can introduce training instability. These findings highlight the persistent challenge of memorization in fine-tuned LLMs and emphasize the need for robust, scalable privacy-preserving techniques.

研究の動機と目的

Define and formalize input-only PII memorization in fine-tuned LLMs.
Quantify memorization using synthetic and real-world datasets under realistic attack scenarios.
Identify factors influencing memorization (language, PII frequency, task type, model size).
Benchmark privacy-preserving strategies and evaluate trade-offs between privacy and task performance.

提案手法

Define True-Prefix Attack (TPA) to probe extraction of PII from fine-tuned autoregressive LLMs.
Use both synthetic and real-world German medical datasets to measure leakage under greedy, sampling, and cross-memorization settings.
Evaluate four mitigation strategies: Differential Privacy (DP), UnDial, Regularization, and Direct Preference Optimization (DPO).
Adapt DP, UnDial, Regularization, and DPO within QLoRA-based fine-tuning across multiple model sizes and architectures.
Analyze effects of prefix length, model size, and language on memorization behavior.

Figure 1: Overview of our experiment setup depicting the unintended PII memorization scenario, our attack, and fine-tuning approaches.

実験結果

リサーチクエスチョン

RQ1What is the phenomenon and formal definition of unintended input-only PII memorization in fine-tuned LLMs?
RQ2How much PII can be memorized and extracted under realistic attack settings across languages, tasks, and model sizes?
RQ3What factors (language, PII frequency, task type, model size) influence memorization severity?
RQ4How do privacy-preserving methods trade off privacy versus task performance in fine-tuning?

主な発見

Post-training mitigation methods (DPO, UnDial) generally yield more consistent privacy–utility trade-offs than preventive approaches.
Differential privacy can significantly reduce leakage in some settings but may introduce training instability and variable results across runs.
DP often provides the strongest leakage reduction for some datasets, but memorization persists under enhanced attacks.
Model size and architecture influence memorization, with larger models showing higher baseline capacity to reveal PII even without fine-tuning.
Memorization risk is not solely predicted by PII frequency; context and task utility play significant roles.

Figure 2: Distribution of per‐token log‑likelihoods for ground‑truth PII completions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。