QUICK REVIEW

[论文解读] Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin, Mihai Surdeanu|arXiv (Cornell University)|Aug 16, 2023

Natural Language Processing Techniques被引用 22

一句话总结

paper presents an inexpensive method to detect data contamination in LLMs by comparing instance-level replica signals from guided prompts against general prompts, then extrapolating to partition-level contamination with GPT-4 ICL or BLEURT/ROUGE-L metrics, across GPT-3.5 and GPT-4.

ABSTRACT

Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

研究动机与目标

Motivate the need to assess data contamination in LLMs due to test data in training data.
Propose an instance-level contamination signal via guided instruction that prompts LLMs to complete partial dataset instances.
Develop partition-level contamination heuristics to infer leakage for dataset splits (train/test/validation).
Evaluate the approach across multiple datasets and two LLMs with human expert validation.
Demonstrate the method’s effectiveness and reveal contamination signals in specific datasets (AG News, WNLI, XSum).

提出的方法

Use guided instruction that includes dataset name, partition, and a random initial segment of a reference instance to elicit completions from the LLM.
Measure instance-level contamination with BLEURT and ROUGE-L by comparing guided versus general instructions; also use GPT-4 few-shot in-context learning to detect exact/near-exact matches.
Algorithm 1: label a partition contaminated if average overlap under guided instruction is significantly higher than under general instruction (bootstrap non-parametric test).
Algorithm 2: label a partition contaminated if GPT-4 ICL flags at least one exact match or at least two near-exact matches among ten instance completions.
Conduct a controlled instance-replication study by intentionally contaminating GPT-3.5 and testing GPT-4 on GSM8k to validate exact/near-exact match criteria.
Compare against ChatGPT-Cheat? and perform human evaluation to align automated signals with expert judgments.

实验结果

研究问题

RQ1Can instance-level guided instruction reveal contamination signals in LLMs when pre-training data are unknown?
RQ2Do partition-level heuristics based on overlap scores or GPT-4 ICL reliably identify contaminated dataset partitions?
RQ3Which datasets show evidence of contamination in contemporary LLMs (GPT-4, GPT-3.5) according to the proposed methods?
RQ4How do automated signals compare with human judgments in detecting contamination across train/test splits?
RQ5What are practical limitations and sources of data contamination highlighted by the study?

主要发现

Best method (Algorithm 2 with GPT-4 ICL) achieved 92%–100% accuracy vs human labels across 14/14 GPT-4 and 13/14 GPT-3.5 settings.
GPT-4 showed contamination signals in AG News, WNLI, and XSum datasets.
Algorithm 1’s performance varied by metric and model; BLEURT/ROUGE-L gave mixed results.
ChatGPT-Cheat? labeled many partitions as suspicious, but often did not align with strict contamination 판단, underscoring the need for instance-level to partition-level reasoning.
Human evaluation confirmed exposure of train/test partitions in GPT-4 for AG News and WNLI, XSum exposure in GPT-4; GPT-3.5 exposure limited to XSum test.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。