QUICK REVIEW

[論文レビュー] Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin, Mihai Surdeanu|arXiv (Cornell University)|Aug 16, 2023

Natural Language Processing Techniques被引用数 22

ひとこと要約

本論文は、guided prompts からのインスタンスレベルのレプリカ信号を一般プロンプトと比較し、GPT-4 ICL または BLEURT/ROUGE-L 指標を用いてパーティションレベルの汚染を外挿する、LLM のデータ汚染を検出する低コストな方法を提示する。対象は GPT-3.5 と GPT-4 に跨る。

ABSTRACT

Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

研究の動機と目的

訓練データに含まれるテストデータが原因で LLM におけるデータ汚染を評価する必要性を動機づける。
LLM に部分的なデータセットの事例を完成させるよう促すガイド付き指示を用いたインスタンスレベルの汚染信号を提案する。
データセット分割（train/test/validation）のリークを推定するためのパーティションレベルの汚染ヒューリスティクスを開発する。
複数のデータセットと二つの LLM に対して人間の専門家検証を伴ってアプローチを評価する。
本手法の有効性を示し、特定のデータセット（AG News、WNLI、XSum）における汚染信号を明らかにする。

提案手法

データセット名、パーティション、および参照インスタンスのランダムな初期セグメントを含むガイド付き指示を用いて LLM からの完成を誘発する。
ガイド付きと一般指示を比較してBLEURTとROUGE-Lでインスタンスレベルの汚染を測定する；またGPT-4 の少数ショット・インコンテキスト学習を用いて正確/ほぼ正確な一致を検出する。
Algorithm 1: guided instruction 下の平均的なオーバーラップが general instruction 下より有意に高い場合にパーティションを汚染とラベル付けする（ブートストラップの非パラメトリック検定）。
Algorithm 2: GPT-4 ICL が ten のインスタンス完成のうち少なくとも一つの厳密一致、または少なくとも二つのほぼ厳密一致を指摘した場合にパーティションを汚染とラベル付けする。
正確/ほぼ正確な一致の基準を検証するために、GPT-3.5 を意図的に汚染させ、GPT-4 を GSM8k でテストする制御付きインスタンス複製研究を実施する。
ChatGPT-Cheat? と比較し、自動信号と専門家の判断を整合させるための人間評価を実施する。

実験結果

リサーチクエスチョン

RQ1前提学習データが未知の場合、インスタンスレベルのガイド付き指示は LLM における汚染信号を明らかにできるか。
RQ2オーバーラップスコアやGPT-4 ICL に基づくパーティションレベルのヒューリスティクスは、汚染されたデータセットのパーティションを信頼性高く同定できるか。
RQ3提案手法に従って、どのデータセットが現代の LLM（GPT-4、GPT-3.5）で汚染の証拠を示すか。
RQ4訓練分割とテスト分割にまたがる汚染検出において、自動信号と人間の判断はどのように比較されるか。
RQ5本研究で強調される実用的な制限とデータ汚染の発生源は何か。

主な発見

最良の方法（Algorithm 2 with GPT-4 ICL）は、人間のラベルに対して 14/14 GPT-4 および 13/14 GPT-3.5 の設定で 92%–100% の精度を達成した。
GPT-4 は AG News、WNLI、XSum のデータセットで汚染信号を示した。
Algorithm 1 の性能は指標とモデルによって異なり、BLEURT/ROUGE-L は混在した結果を示した。
ChatGPT-Cheat? は多くのパーティションを疑わしいとラベル付けしたが、厳密な汚染判断と一致することは少なく、インスタンスレベルからパーティションレベルの推論の必要性を強調している。
人間による評価は、AG NewsとWNLI、XSum で GPT-4 による訓練/テスト分割の露出を確認した。GPT-4 では XSum の露出も確認。GPT-3.5 の露出は XSum のテストに限られていた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。