QUICK REVIEW

[논문 리뷰] Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz|arXiv (Cornell University)|2026. 01. 24.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

본 논문은 미세 조정된 LLM에서 입력 전용 PII 기억화를 정의하고 정량화하며, 누출에 영향을 미치는 요인을 분석하고 여러 모델과 데이터세트에 걸쳐 프라이버시 보호 방식(DP, UnDial, Regularization, DPO)의 벤치마크를 수행한다.

ABSTRACT

Fine-tuning Large Language Models (LLMs) on sensitive datasets carries a substantial risk of unintended memorization and leakage of Personally Identifiable Information (PII), which can violate privacy regulations and compromise individual safety. In this work, we systematically investigate a critical and underexplored vulnerability: the exposure of PII that appears only in model inputs, not in training targets. Using both synthetic and real-world datasets, we design controlled extraction probes to quantify unintended PII memorization and study how factors such as language, PII frequency, task type, and model size influence memorization behavior. We further benchmark four privacy-preserving approaches including differential privacy, machine unlearning, regularization, and preference alignment, evaluating their trade-offs between privacy and task performance. Our results show that post-training methods generally provide more consistent privacy-utility trade-offs, while differential privacy achieves strong reduction in leakage in specific settings, although it can introduce training instability. These findings highlight the persistent challenge of memorization in fine-tuned LLMs and emphasize the need for robust, scalable privacy-preserving techniques.

연구 동기 및 목표

미세 조정된 LLM에서 입력-전용 PII 기억화를 정의하고 형식화한다.
현실적 공격 시나리오에서 합성 데이터와 실제 세계 데이터세트를 사용하여 기억화를 정량화한다.
언어, PII 빈도, 작업 유형, 모델 크기 등 기억화에 영향을 미치는 요인을 식별한다.
프라이버시 보호 전략의 벤치마크를 수행하고 프라이버시와 작업 성능 간의 트레이드오프를 평가한다.

제안 방법

정밀 아키에티브? True-Prefix Attack (TPA)을 정의하여 미세 조정된 자동회귀(L)LM에서 PII 추출을 조사한다.
합성 데이터와 실제 독일어 의학 데이터세트를 모두 사용하여 누출을 측정한다. Greedy, sampling, and cross-memorization 설정 하에서
네 가지 완화 전략(DP, UnDial, Regularization, Direct Preference Optimization (DPO))을 평가한다.
다양한 모델 크기와 아키텍처에 걸쳐 QLoRA 기반 미세 조정 내에서 DP, UnDial, Regularization, 및 DPO를 적용한다.
프리픽스 길이, 모델 크기, 언어가 기억화 행동에 미치는 영향을 분석한다.

Figure 1: Overview of our experiment setup depicting the unintended PII memorization scenario, our attack, and fine-tuning approaches.

실험 결과

연구 질문

RQ1미세 조정된 LLM에서 의도하지 않은 입력-전용 PII 기억화의 현상과 형식적 정의는 무엇인가?
RQ2현실적 공격 설정에서 언어, 작업, 모델 크기에 걸쳐 기억화된 PII를 얼마나 많이 기억하고 추출할 수 있는가?
RQ3어떤 요인들(언어, PII 빈도, 작업 유형, 모델 크기)이 기억화의 심각도에 영향을 주는가?
RQ4프라이버시 보호 방법은 미세 조정에서 프라이버시와 작업 성능 간의 트레이드오프를 어떻게 나타내는가?

주요 결과

사후 훈련 완화 방법(DPO, UnDial)은 일반적으로 예방적 방법보다 더 일관된 프라이버시-유용성 트레이드오프를 제공한다.
차등 프라이버시는 일부 설정에서 누출을 크게 줄일 수 있지만 학습 불안정성과 실행 간 가변적인 결과를 초래할 수 있다.
DP는 일부 데이터세트에서 가장 큰 누출 감소를 제공하는 반면, 강화된 공격에서도 기억화가 지속된다.
모델 크기와 아키텍처가 기억화에 영향을 주며, 더 큰 모델은 미세 조정 없이도 PII를 드러낼 수 있는 기본 용량이 더 높다.
기억화 위험은 PII 빈도만으로 예측되지 않는다; 맥락과 작업 유용성이 중요한 역할을 한다.

Figure 2: Distribution of per‐token log‑likelihoods for ground‑truth PII completions.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.