QUICK REVIEW

[논문 리뷰] The Impact of Post-training on Data Contamination

Muhammed Yusuf Kocyigit, Caglar Yildirim|arXiv (Cornell University)|2026. 01. 03.

Natural Language Processing Techniques인용 수 0

한 줄 요약

이 연구는 대형 언어 모델의 확장된 사전 학습 동안 통제된 데이터 오염을 주입하고 감독 학습 미세조정(SFT) 및 GRPO를 이용한 강화학습 후 다운스트림 효과를 비교하여, 오염이 사후 학습 후 재현되거나 일반화될 수 있으며 모델 크기에 의해 효과가 증폭됨을 밝혀낸다.

ABSTRACT

We present a controlled study of how dataset contamination interacts with the post-training stages now standard in large language model training pipelines. Starting from clean checkpoints of Qwen2.5 (0.5B/1.5B) and Gemma3 (1B/4B), we inject five copies of GSM8K and MBPP test items into the first 2B tokens of an otherwise 25B token extended pre-training dataset. We then compare the contaminated and clean models both immediately after pre-training and again after two popular post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL) with group relative policy optimization (GRPO). The applied post-training steps do not have any contamination. Across math and coding benchmarks, we find three consistent patterns: (i) Contamination causes performance spikes that are gradually diminished with continued pre-training. After even 25B tokens the apparent performance inflation of contamination can become close to zero. (ii) Both SFT and GRPO resurface the leaked information, but with different external validity: SFT inflates scores only on the contaminated tasks, whereas GRPO also inflates performance on uncontaminated counterparts (GSMPlus, HumanEval). (iii) Model scale amplifies these tendencies, larger Supervised Fine Tuned models memorize more, while larger GRPO models translate leakage into more generalizable capabilities. Our results underscore the need for contamination audits \emph{after} post-training and suggest that RL-based post-training, although not immune, can help alleviate contamination-related over-estimation problems.

연구 동기 및 목표

대형 언어 모델에서 사전 학습 데이터 오염과 사후 학습 단계가 어떻게 상호 작용하는지 평가한다.
수학 및 코딩 과제에서 두 가지 사후 학습 패러다임(SFT와 GRPO를 이용한 RL) 이후 오염 효과를 평가한다.
사후 학습 전반에 걸쳐 모델 규모가 오염 암기와 일반화에 어떤 영향을 미치는지 조사한다.
LLM의 데이터 유출의 생애 주기 효과를 평가하기 위한 오염 감사 및 지침을 제공한다.

제안 방법

25B 확장 사전 학습 데이터 세트의 처음 2B 토큰에 GSM8K 및 MBPP 테스트 항목의 다섯 개 복사본을 주입한다.
Qwen2.5(0.5B/1.5B) 및 Gemma3(1B/4B)의 오염된 체크포인트와 깨끗한 체크포인트를 프리트레인한다.
해당 학습 분할에 두 가지 사후 학습 절차(SFT 및 GRPO 기반 RL)를 적용하고 결과를 비교한다.
일반화를 평가하기 위해 GSM8k와 MBPP를 오염된 벤치마크로, GSMPlus와 HumanEval을 오염되지 않은 벤치마크로 사용하여 평가한다.
구성 간 일관된 평가를 보장하기 위해 LM Evaluation Harness 및 수학 검증 도구를 사용한다.

Figure 1 : An Overview of our Method: We take existing pre-trained models and run them through extended pre-training with and without contamination. Afterwards we post-train them using SFT or RL methods and compare their performance. The pre-trained checkpoints here are from Qwen2.5 and Gemma3 non-i

실험 결과

연구 질문

RQ1사후 학습이 데이터 오염으로 인한 성능 과대 추정을 완화시키는가, 아니면 악화시키는가?
RQ2오염 효과가 SFT와 GRPO 사후 학습 간에 차이가 있는가?
RQ3사후 학습 이후 오염의 지속성 또는 일반화에 모델 규모가 어떤 영향을 미치는가?
RQ4오염이 존재할 때 사후 학습 절차가 오염되지 않은 벤치마크에서 이득을 만들어내는가?
RQ5사전 학습에서 사후 학습에 이르는 오염의 다운스트림 태스크에 대한 생애 주기 영향은 무엇인가?

주요 결과

오염은 노출 중에 성능 급등을 유발할 수 있지만 지속적인 사전 학습으로 소실되더라도 누출된 정보는 사후 학습 중에 다시 얻을 수 있다.
SFT는 주로 오염된 과제에서 점수를 부풀리고, GRPO도 오염되지 않은 벤치마크에서 성능을 부풀려 순수한 암기보다는 더 넓은 일반화를 시사한다.
SFT 하에서 모델 규모가 오염 효과를 증폭시키며 더 큰 모델은 더 많은 암기를 보이고, 반면 GRPO는 누출을 오염된 벤치마크와 외부 벤치마크 모두에서의 개선으로 전환한다.
사후 학습은 사전 학습만으로는 가려질 수 있는 오염 효과를 다시 불러일으키고, 일부 설정에서 약 4포인트 격차에 달하는 차이를 만들어낸다.
GRPO는 더 일반화 가능한 개선을 가져오는 경향이 있으며 규모에 따른 오염 격차를 줄일 수 있지만, SFT는 오염된 과제에 이득을 집중하는 경향이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.