QUICK REVIEW

[論文レビュー] Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Jeffrey Cheng, Marc Marone|arXiv (Cornell University)|Mar 19, 2024

Topic Modeling被引用数 7

ひとこと要約

この論文は、リソースレベルでの LLM の効果的カットオフを定義し、報告されたカットオフと乖離することが多いことを示し、デデュプリケーションと CommonCrawl のタイミングずれといった原因を分析する。

ABSTRACT

Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.

研究の動機と目的

LLM に対してサブリソースやトピックに適用される効果的カットオフの概念を定義する。
データ版本間を probing してリソースレベルの効果的カットオフを推定する簡易手法を開発する。
複数モデルとデータセットにわたり、効果的カットオフが報告カットオフと整合するかを評価する。
前訓練データとデデュプリケーションパイプラインの根本原因を調査する。

提案手法

WikiSpan (Wikipedia の 2016–2023 バージョン) および NewsSpan (NYT 記事 2016–2020) を含む時間 spans のリソースセットを収集する。
各時間バケットの文書に対して perplexity を測定し、正規化後の相対 perplexity を計算して LLM を探査する。
中央値ベースの平滑化で perplexities を集約し、0–1 に min-max スケールしてモデル間でリソースレベルのカットオフを比較する。
BM25 indexing および edit-distance matching を用いて perplexity の傾向に影響を与え得る近似重複・コピーを特定するために、事前訓練データを掘り起こす。
デデュプリケーションと時刻整合性における C4, Pile, RefinedWeb の3つの主要データソースとその役割を分析する。
モデルファミリ (Pile-based, FalconRW, C4-derived) を検討して効果的カットオフの挙動を解釈する。

実験結果

リサーチクエスチョン

RQ1LLM における特定のリソースの効果的カットオフとは何か、そしてそれはモデルが主張するカットオフとどのように関連するか？
RQ2リソースレベルの効果的カットオフは、異なるモデルとデータソース間で報告されたカットオフと整合しているか？
RQ3データ処理要因（デデュプリケーションや CommonCrawl のタイミングなど）は効果的カットオフと報告カットオフの乖離を説明するか？
RQ4時間を跨ぐデータセット（WikiSpan と NewsSpan）は大規模ウェブダンプで訓練されたモデルの時系列的な到達範囲をどのように反映しているか？
RQ5モデルのスケールは効果的カットオフの検出にどのような影響を与えるか？

主な発見

多くのモデル、特に新しいものについて、効果的カットオフが報告されたカットオフと異なることが多い。
Perplexity ベースの探査は、リソースのバージョン分布に対応する効果的な知識カットオフを識別できる。
乖離の2つの主な原因は、デデュプリケーションで除去されなかった意味的に近い重複と、トレーニングに用いられた CommonCrawl ダンプ内の古いデータの存在である。
CommonCrawl の時間のずれにより、新しいダンプが依然として古い材料を大量に含む可能性があり、効果的カットオフを早い日付へ偏らせる。
Pile-basedモデルは Wikipedia のダンプ日付との整合性がより鋭く現れる一方、他のファミリはデデュプリケーションとデータ構成による乖離を示す。
スケーリング効果は、モデルサイズが大きくなっても効果的カットオフの乖離を解消しないことを示し、より大きなモデルでも同じデータダイナミクスを反映している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。