[论文解读] Learning to Detect Language Model Training Data via Active Reconstruction
论文提出 Active Data Reconstruction Attacks (ADRA 和 ADRA+),通过对目标 LLM 进行强化学习微调来重建候选训练数据,从而在预训练、后训练和蒸馏设置下提升成员推断Across.
Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce extbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are extit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, extsc{ADRA} and its adaptive variant extsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.
研究动机与目标
- Motivate the need to detect training data leakage in LLMs due to memorization and potential privacy concerns.
- Propose an active MIA framework that elicits latent membership signals by updating model weights during training.
- Demonstrate that RL-based reconstruction rewards reveal stronger membership signals than passive MIAs across multiple models and data regimes.
提出的方法
- Formulate Active Data Reconstruction Attack (ADRA) as on-policy RL that fine-tunes the target model to maximize reconstruction similarity to a true suffix.
- Use a contrastive reward design with a pool of true suffix and distractors to guide reconstruction (matching variant) or adaptively include true suffix based on a prior (adaptive matching).
- Define several lexical reconstruction reward metrics (token set similarity, longest common subsequence, n-gram set coverage) to measure reconstruction quality.
- Apply Group Relative Policy Optimization (GRPO) to update the policy and evaluate membership by re-sampling completions from the finetuned model.
- Construct six new MIA datasets spanning pre-training, post-training, and distillation to match frontier-model knowledge cutoffs and post-training scales.
- Compare ADRA/ADRA+ against passive MIAs (loss-based, zlib, Min-K%, Min-K%++, N-Sampling) across open-weight LLMs.

实验结果
研究问题
- RQ1Can latent membership signals be elicited by actively updating model weights during reconstruction tasks?
- RQ2Do ADRA 和 ADRA+ improve MIA performance over passive attacks across pre-training, post-training, and distillation scenarios?
- RQ3Which reconstruction metrics and reward designs yield the strongest membership inference signals?
- RQ4How do ADRA/ADRA+ scale with model size and data contamination settings?
- RQ5How do newly constructed MIA datasets reflect real-world training data leakage risks?
主要发现
- ADRA and ADRA+ consistently outperform prior MIAs across pre-training, post-training, and distillation settings.
- ADRA+ achieves up to 60.6% AUROC on WikiMIA 2024 Hard (pre-training) and 85.9% AUROC on AIME (post-training) in original settings, surpassing baselines.
- In distillation, ADRA attains near-perfect membership inference (S1.1: 98.4% AUROC).
- RL-based optimization and contrastive reconstruction rewards are critical to gains, outperforming supervised fine-tuning.
- Lexical reconstruction rewards outperform embedding or LLM-based rewards in AUROC across tested settings.
- Post-training data show stronger memorization and are more extractable than pre-training data.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。