QUICK REVIEW

[論文レビュー] Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin, John X. Morris|arXiv (Cornell University)|Feb 22, 2026

Topic Modeling被引用数 0

ひとこと要約

要約: 本論文は、ADRAとADRA+（Active Data Reconstruction Attacks）を提案し、ターゲットLLMを強化学習で活性化的に微調整して候補トレーニングデータを再構成し、事前学習・事後学習・蒸留設定全般でメンバーシップ推定を向上させる。

ABSTRACT

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

研究の動機と目的

memorization による LLM のトレーニングデータ流出の検知の必要性と潜在的プライバシー問題を動機づける。
トレーニング中にモデルの重みを更新して潜在的メンバーシップ信号を引き出すアクティブな MIA フレームワークを提案する。
RLベースの再構成報酬が複数のモデルとデータレジームでパッシブなMIAsより強いメンバーシップ信号を明らかにすることを示す。

提案手法

ADRA を対象モデルを最大の再構成類似度へと適合させるオンポリシーRLとして定式化する。
再構成を誘導するために真のサフィックスとディストラクターのプールを用いた対照報酬設計（マッチング変種）または事前確率に基づいて真のサフィックスを適応的に含める（適応マッチング）。
再構成品質を測る複数の語彙再構成報酬指標（トークン集合類似度、最長共通部分列、n-gram集合カバレッジ）を定義する。
Policyを更新するために Group Relative Policy Optimization（GRPO）を適用し、微調整済みモデルからの補完を再サンプリングしてメンバーシップを評価する。
frontier-model の知識カットオフと事後学習規模に合わせた6つの新しい MIA データセットを蒔く。
open-weight LLMs に対して、ADRA/ADRA+ をパッシブMIAs（loss-based、zlib、Min-K%、Min-K%++、N-Sampling）と比較する。

Figure 1 : Active Data Reconstruction Attack. Language model generates reconstructions from a candidate prefix and is rewarded via a contrastive objective. Members become easier to reconstruct than non-members over RL training, improving MIA performance.

実験結果

リサーチクエスチョン

RQ1再構成タスク中の重み更新を積極的に行うことで潜在的なメンバーシップ信号を引き出せるか？
RQ2ADRAとADRA+は事前学習・事後学習・蒸留シナリオ全般でパッシブ攻撃よりMIA性能を改善するか？
RQ3どの再構成指標と報酬設計が最も強いメンバーシップ推定信号を生むか？
RQ4ADRA/ADRA+ はモデルサイズやデータ汚染設定とともにどの程度スケールするか？
RQ5新しく構築された MIA データセットは現実世界のトレーニングデータ漏洩リスクをどのように反映しているか？

主な発見

ADRA と ADRA+ は事前学習・事後学習・蒸留設定のいずれにおいても従来の MIAs を一貫して上回る。
ADRA+ は WikiMIA 2024 Hard（事前学習）で最大 60.6% AUROC、AIME（事後学習）で最大 85.9% AUROC の成績を達成し、ベースラインを上回る。
蒸留では ADRA がほぼ完璧なメンバーシップ推定を達成（S1.1: 98.4% AUROC）。
RL ベースの最適化と対照的な再構成報酬が成績向上の決定的要因で、教師ありファインチューニングを上回る。
語彙再構成報酬は、試験設定全体で埋め込みや LLM ベースの報酬より AUROC が高い。
事後学習データは memorization が強く、事前学習データより抽出可能性が高い。

Table 4 : Pre-training member data reconstruction for original verbatim setting. Bold denotes the best average performance for each dataset. ADRA+ and ADRA consistently outperform the N-Sampling across all metrics. See Appendix E for paraphrased setting results.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。