QUICK REVIEW

[論文レビュー] WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Conghui He, Zhenjiang Jin|arXiv (Cornell University)|Aug 21, 2023

Topic Modeling被引用数 8

ひとこと要約

WanJuan は、2TBを超える大規模な多言語・マルチモーダルデータセット（テキスト、画像-テキスト、ビデオ）で、LLM/MLLMの訓練と評価を支援し、安全性と品質フィルタリングを提供する。

ABSTRACT

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.

研究の動機と目的

diverse web sources から中国語と英語の大規模マルチモーダル訓練コーパスを提供する。
アルゴリズム処理と手動検証を通じて安全性・高品質・価値整合性を確保する。
大規模モデルの素朴な訓練とファインチューニングを促進する統一JSONフォーマット、ダウンロードツール、ドキュメントを提供する。

提案手法

英語と中国語の多様なウェブソースからテキスト、画像-テキスト、ビデオデータを収集する。
不適切なコンテンツや低品質データ（ポルノ、暴力、バイアス、自動生成コンテンツ）を除去するための多段階の清掃とフィルタリングを実施する。
データスクリーニングのために言語検出、重複排除（MinHashLSH, n-gram）、品質/安全性分類器（FastText）を使用する。
サイト固有の解析ルールと記事本文抽出（Wikipedia ヘッダは保持）で画像-テキストデータを処理する。
データを統一JSONフォーマットに標準化し、容易に使用できるダウンロードツールとドキュメントを提供する。

実験結果

リサーチクエスチョン

RQ1LLM/MLLMの訓練に適したバイリンガル・マルチモーダルコーパスの構成とスケーリングはどのようになるか？
RQ2大規模多言語データを安全性・品質・価値指向の整合性のためにどのようにクリーニング・整合化できるか？
RQ3多様なモダリティ（テキスト、画像-テキスト、ビデオ）が英中モデルの事前訓練成果に与える影響は？

主な発見

テキストデータ部分は6億以上の文書を含み、ストレージ容量は1TB超（テキストデータ総計は624Mファイル、1019.7 GB）。
インタリーブ型の画像-テキストデータは2200万超の文書を含み、サイズは200GB超（画像はURL経由で提供）。
ビデオデータは1000本を超え、サイズは900GB超。
アルゴリズム処理と手動検証を通じて安全性・高品質・価値整合性を重視。
統一JSON処理フォーマット、データセットダウンロードツール、迅速なモデル訓練を支援するドキュメントを提供。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。