QUICK REVIEW

[论文解读] The Impact of Post-training on Data Contamination

Muhammed Yusuf Kocyigit, Caglar Yildirim|arXiv (Cornell University)|Jan 3, 2026

Natural Language Processing Techniques被引用 0

一句话总结

本研究在大语言模型的扩展预训练中引入受控的数据污染，并在经监督微调（SFT）和基于 GRPO 的强化学习后比较下游效应，揭示污染在后训练阶段可能卷土重来并泛化，且效应随模型规模增大而增强。

ABSTRACT

We present a controlled study of how dataset contamination interacts with the post-training stages now standard in large language model training pipelines. Starting from clean checkpoints of Qwen2.5 (0.5B/1.5B) and Gemma3 (1B/4B), we inject five copies of GSM8K and MBPP test items into the first 2B tokens of an otherwise 25B token extended pre-training dataset. We then compare the contaminated and clean models both immediately after pre-training and again after two popular post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL) with group relative policy optimization (GRPO). The applied post-training steps do not have any contamination. Across math and coding benchmarks, we find three consistent patterns: (i) Contamination causes performance spikes that are gradually diminished with continued pre-training. After even 25B tokens the apparent performance inflation of contamination can become close to zero. (ii) Both SFT and GRPO resurface the leaked information, but with different external validity: SFT inflates scores only on the contaminated tasks, whereas GRPO also inflates performance on uncontaminated counterparts (GSMPlus, HumanEval). (iii) Model scale amplifies these tendencies, larger Supervised Fine Tuned models memorize more, while larger GRPO models translate leakage into more generalizable capabilities. Our results underscore the need for contamination audits \emph{after} post-training and suggest that RL-based post-training, although not immune, can help alleviate contamination-related over-estimation problems.

研究动机与目标

评估后训练阶段如何与预训练数据污染在大语言模型中交互。
在两种后训练范式（SFT 与带 GRPO 的 RL）对数学与编码任务评估污染效应。
考察模型规模如何影响污染记忆与跨后训练的泛化。
提供污染审核与在评估数据泄漏生命周期效应时的指导。

提出的方法

在一个 25B 的扩展预训练数据集中前 20 亿个令牌中注入五份 GSM8K 与 MBPP 测试项。
对 Qwen2.5（0.5B/1.5B）与 Gemma3（1B/4B）进行污染与洁净检查点的预训练。
对相应训练分割应用两种后训练程序（SFT 与基于 GRPO 的 RL），并比较结果。
以 GSM8K 与 MBPP 作为污染基准评估，同时以 GSMPlus 与 HumanEval 作为未污染基准评估泛化能力。
使用 LM Evaluation Harness 与数学验证工具，确保在各设定中评估的一致性。

Figure 1 : An Overview of our Method: We take existing pre-trained models and run them through extended pre-training with and without contamination. Afterwards we post-train them using SFT or RL methods and compare their performance. The pre-trained checkpoints here are from Qwen2.5 and Gemma3 non-i

实验结果

研究问题

RQ1后训练会缓解还是加剧由数据污染引起的性能高估？
RQ2污染效应在 SFT 与 GRPO 后训练之间有差异吗？
RQ3模型规模如何影响后训练后污染的持久性或泛化？
RQ4在污染存在时，后训练程序是否在未污染基准上产生性能提升？
RQ5从预训练到后训练的污染对下游任务的生命周期影响是什么？

主要发现

污染在暴露阶段可能引发性能峰值，但随着进一步的预训练会减弱；泄漏信息在后训练阶段仍然可检索。
SFT 主要在被污染任务上抬高分数；而 GRPO 也在未污染的基准上抬高性能，表明存在更广泛的泛化而非纯粹记忆。
在 SFT 下模型规模放大了污染效应，较大模型记忆更多；而 GRPO 将泄漏转化为对被污染和外部基准的提升。
后训练重新唤起污染效应，某些情境下相较仅有预训练可产生约 4 点的差距。
GRPO 更倾向于带来更具泛化性的提升，并且可以随着规模的增加而减少污染差距；与之相比，SFT 往往将收益集中在被污染的任务上。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。