QUICK REVIEW

[论文解读] Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin, John X. Morris|arXiv (Cornell University)|Feb 22, 2026

Topic Modeling被引用 0

一句话总结

论文提出 Active Data Reconstruction Attacks (ADRA 和 ADRA+)，通过对目标 LLM 进行强化学习微调来重建候选训练数据，从而在预训练、后训练和蒸馏设置下提升成员推断Across.

ABSTRACT

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce extbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are extit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, extsc{ADRA} and its adaptive variant extsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

研究动机与目标

Motivate the need to detect training data leakage in LLMs due to memorization and potential privacy concerns.
Propose an active MIA framework that elicits latent membership signals by updating model weights during training.
Demonstrate that RL-based reconstruction rewards reveal stronger membership signals than passive MIAs across multiple models and data regimes.

提出的方法

Formulate Active Data Reconstruction Attack (ADRA) as on-policy RL that fine-tunes the target model to maximize reconstruction similarity to a true suffix.
Use a contrastive reward design with a pool of true suffix and distractors to guide reconstruction (matching variant) or adaptively include true suffix based on a prior (adaptive matching).
Define several lexical reconstruction reward metrics (token set similarity, longest common subsequence, n-gram set coverage) to measure reconstruction quality.
Apply Group Relative Policy Optimization (GRPO) to update the policy and evaluate membership by re-sampling completions from the finetuned model.
Construct six new MIA datasets spanning pre-training, post-training, and distillation to match frontier-model knowledge cutoffs and post-training scales.
Compare ADRA/ADRA+ against passive MIAs (loss-based, zlib, Min-K%, Min-K%++, N-Sampling) across open-weight LLMs.

Figure 1 : Active Data Reconstruction Attack. Language model generates reconstructions from a candidate prefix and is rewarded via a contrastive objective. Members become easier to reconstruct than non-members over RL training, improving MIA performance.

实验结果

研究问题

RQ1Can latent membership signals be elicited by actively updating model weights during reconstruction tasks?
RQ2Do ADRA 和 ADRA+ improve MIA performance over passive attacks across pre-training, post-training, and distillation scenarios?
RQ3Which reconstruction metrics and reward designs yield the strongest membership inference signals?
RQ4How do ADRA/ADRA+ scale with model size and data contamination settings?
RQ5How do newly constructed MIA datasets reflect real-world training data leakage risks?

主要发现

ADRA and ADRA+ consistently outperform prior MIAs across pre-training, post-training, and distillation settings.
ADRA+ achieves up to 60.6% AUROC on WikiMIA 2024 Hard (pre-training) and 85.9% AUROC on AIME (post-training) in original settings, surpassing baselines.
In distillation, ADRA attains near-perfect membership inference (S1.1: 98.4% AUROC).
RL-based optimization and contrastive reconstruction rewards are critical to gains, outperforming supervised fine-tuning.
Lexical reconstruction rewards outperform embedding or LLM-based rewards in AUROC across tested settings.
Post-training data show stronger memorization and are more extractable than pre-training data.

Table 4 : Pre-training member data reconstruction for original verbatim setting. Bold denotes the best average performance for each dataset. ADRA+ and ADRA consistently outperform the N-Sampling across all metrics. See Appendix E for paraphrased setting results.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。