QUICK REVIEW

[論文レビュー] RedPajama: an Open Dataset for Training Large Language Models

Maurice Weber, Daniel Fu|arXiv (Cornell University)|Nov 19, 2024

Natural Language Processing Techniques被引用数 17

ひとこと要約

本論文は RedPajama-V1（LLaMA トレーニングデータの公開再現）と RedPajama-V2（品質信号を備えた巨大なウェブデータ）を公開して、透明でスケーラブルなオープンLLM開発を促進し、品質信号がモデル性能を向上させる ablations を提示します。

ABSTRACT

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

研究の動機と目的

オープンソースLLMにおける透明なデータキュレーションの必要性を示し、データセットを公開する。
RedPajama-V1 を LLaMA のトレーニングデータの公開再現として提供し、RedPajama-V2 を品質信号を備えた大規模なウェブ専用データセットとして提供する。
品質信号を用いてより高品質なデータサブセットをキュレーションし、モデル性能を向上させる方法を示す。
データセット上で訓練された RedPajama-INCITE モデルを説明し、オープンベースラインと比較してその性能を評価する。

提案手法

詳細なドキュメンテーションと処理手順を伴い、LLaMA トレーニングコーパスを再現して RedPajama-V1 を作成する。
5言語にまたがる84個の Common Crawl スナップショット（2014–2023）を収集し、文書ごとに46の品質信号を付与して RedPajama-V2 を作成する。
自然言語、反復性、コンテンツベース、ML ヒューリスティクス、重複排除指標などの品質信号を公開する。
Summit 上で REDPajama-INCITE モデル（3B および 7B）を訓練し、アーキテクチャと FP16 の制限に対処するためのカスタムエンジニアリングを施す。
デコーダーのみモデル（468M および 1.6B）でアブレーションを実施し、品質信号が下流の NLP ベンチマークへ与える影響を評価する。
総合的なベンチマーク指標を用いて、RedPajama バリアントをオープンベースラインと比較する。

Figure 1: The ecosystem around the RedPajama datasets. RedPajama has provided pretraining data for multiple open-source LLMs, including OpenELM [ 36 ] , OLMo [ 19 ] , Snowflake’s Arctic [ 54 ] and RedPajama-INCITE. SlimPajama is a cleaned and deduplicated version of RedPajama-V1.

実験結果

リサーチクエスチョン

RQ1オープンLLMデータセットをより透明で再現性の高いものにするにはどうすればよいか？
RQ2ウェブ由来の事前学習データの品質とモデル性能に対するさまざまな品質信号の適用の影響は何か？
RQ3超大規模なオープンウェブデータセット（RPv2）は、標準ベンチマーク全般で競争力のあるオープンLLMを実現できるか？
RQ4大規模トレーニングコーパスを商用機材や限られたHPCリソースで再現する際のトレードオフと実務的考慮事項は何か？

主な発見

RPv1 は LLaMA 訓練済みコーパスを忠実に再現し、再現可能なオープンベースラインを提供します。
RPv2 は文書ごとに 46 の品質信号を備えた巨大なウェブコーパスを提供し、原則的なフィルタリングとアブレーションを可能にします。
品質信号は、468M および 1.6B パラメータモデルのアブレーションにおいて下流ベンチマークの性能に意味のある影響を与えられる。
RedPajama-INCITE モデルは、Summit 上で訓練されたモデルは、同程度のサイズのオープンモデルと比較して few-shot および zero-shot 性能で競争力を示し、instruct バリアントは few-shot タスクで特に優れている。
アブレーションは、さまざまな品質フィルタリングルールが平均ベンチマーク性能と perplexity に与える影響を示している。
RPv2 のメタデータ豊富な設計は、高品質データサブセットを用いた迅速な実験を促進します。

Figure 2: RedPajama-INCITE-Base 3B results on a subset of lm-evaluation-harness. The tasks were selected according to the selection made to evaluate Pythia [ 4 ] and GPT-J [ 59 ]

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。