QUICK REVIEW

[論文レビュー] The Impact of Post-training on Data Contamination

Muhammed Yusuf Kocyigit, Caglar Yildirim|arXiv (Cornell University)|Jan 3, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

この研究は大規模言語モデルの拡張前学習に制御されたデータ汚染を注入し、監視付きファインチューニング（SFT）およびGRPOを用いた強化学習後の下流効果を比較する。汚染はポストトレーニング後に再出現・一般化する可能性があり、モデルサイズによって影響が増幅される。

ABSTRACT

We present a controlled study of how dataset contamination interacts with the post-training stages now standard in large language model training pipelines. Starting from clean checkpoints of Qwen2.5 (0.5B/1.5B) and Gemma3 (1B/4B), we inject five copies of GSM8K and MBPP test items into the first 2B tokens of an otherwise 25B token extended pre-training dataset. We then compare the contaminated and clean models both immediately after pre-training and again after two popular post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL) with group relative policy optimization (GRPO). The applied post-training steps do not have any contamination. Across math and coding benchmarks, we find three consistent patterns: (i) Contamination causes performance spikes that are gradually diminished with continued pre-training. After even 25B tokens the apparent performance inflation of contamination can become close to zero. (ii) Both SFT and GRPO resurface the leaked information, but with different external validity: SFT inflates scores only on the contaminated tasks, whereas GRPO also inflates performance on uncontaminated counterparts (GSMPlus, HumanEval). (iii) Model scale amplifies these tendencies, larger Supervised Fine Tuned models memorize more, while larger GRPO models translate leakage into more generalizable capabilities. Our results underscore the need for contamination audits \emph{after} post-training and suggest that RL-based post-training, although not immune, can help alleviate contamination-related over-estimation problems.

研究の動機と目的

大規模言語モデルにおける前トレーニングデータ汚染とポストトレーニング段階の相互作用を評価する。
SFTとGRPOを用いた2つのポストトレーニングパラダイム後の汚染効果を数学とコーディング課題で評価する。
モデル規模がポストトレーニング後の汚染の記憶化と一般化に与える影響を検討する。
データ漏洩のライフサイクル効果を評価するための汚染監査とガイダンスを提供する。

提案手法

25Bの拡張前学習データセットの最初の2BトークンにGSM8KとMBPPのテスト項目を5部コピー注入する。
contaminated（汚染あり）と clean（汚染なし）のQwen2.5（0.5B/1.5B）およびGemma3（1B/4B）の事前学習済みチェックポイントを作成する。
対応するトレーニング分割に2つのポストトレーニング手順（SFTとGRPOベースのRL）を適用し、結果を比較する。
GSM8kとMBPPを汚染ベンチマークとして、GSMPlusとHumanEvalを無汚染ベンチマークとして評価し、 generalization を評価する。
LM Evaluation Harnessとmath-verifyツールを用いて設定間で一貫した評価を確保する。

Figure 1 : An Overview of our Method: We take existing pre-trained models and run them through extended pre-training with and without contamination. Afterwards we post-train them using SFT or RL methods and compare their performance. The pre-trained checkpoints here are from Qwen2.5 and Gemma3 non-i

実験結果

リサーチクエスチョン

RQ1ポストトレーニングは、データ汚染による性能過大評価を緩和するか、あるいは強化するか。
RQ2SFTとGRPOのポストトレーニングで汚染効果に差が生じるか。
RQ3モデル規模はポストトレーニング後の汚染の持続性または一般化にどう影響するか。
RQ4汚染が存在する場合、ポストトレーニング手順は無汚染ベンチマークでの利益を生むか。
RQ5前トレーニングからポストトレーニングまでの汚染が下流タスクに及ぼすライフサイクル影響は何か。

主な発見

汚染は露出中に性能の急上昇を引き起こすことがあるが、前トレーニングの継続で低下する一方、漏洩情報はポストトレーニング中に再取得可能である。
SFTは汚染された課題でスコアを主に膨らませる一方、GRPOは無汚染ベンチマークでも性能を膨らませ、単なる記憶ではなくより広い一般化を示す。
モデル規模はSFT下で汚染効果を増幅させ、より大きなモデルはより多くを記憶する。一方GRPOは漏洩を無汚染ベンチマークにも適用して改善を生み出す。
ポストトレーニングは前トレーニングだけではマスクされていた汚染効果を復活させ、設定によっては約4ポイントのギャップを生み出す。
GRPOはより一般化可能な改善を生みやすく、スケールとともに汚染ギャップを減少させる傾向があるのに対し、SFTは汚染されたタスクに利益を集中させる傾向がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。