QUICK REVIEW

[論文レビュー] A New Massive Multilingual Dataset for High-Performance Language Technologies

Ona de Gibert, Graeme Nail|arXiv (Cornell University)|Mar 20, 2024

Text and Document Classification Technologies被引用数 5

ひとこと要約

tldr: HPLT は、多言語リソース群を大規模にリリースし、75言語のモノリンガルデータ（約5.6兆トークン）と18の言語ペアの英語中心のパラレルデータ（約9600万文ペア）、さらにはウェブクロールコーパスの処理用の合成ピボットとツールを提供します。

ABSTRACT

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

研究の動機と目的

大規模なウェブクロールコーパスを多言語NLPのために収集、処理、重複排除するスケーラブルな方法論を提供する。
低〜中リソース言語をサポートするためのオープンでCC0ライセンスのモノリンガルおよびパラレルコーパスをリリースする。
データパイプライン、品質管理、および再現性と再利用性を研究コミュニティに提供するメタデータを説明する。

提案手法

Internet Archiveのクローリング（IA WIDE15/16/17）およびCommon Crawl CC40 からのデータソース。
warc2text を用いたWARCベースのテキスト抽出と初期言語検出。
モノリンガルおよびバイリンガル処理のためのHPC（LUMI）でのシャーディングとパイプラインオーケストレーション。
モノリンガル処理パイプライン（Monotextor）による言語識別（FastSpell/Hunspell）、エンコードの修正、流暢性スコア付け。
並列テキスト整列のためのバイテキスト抽出パイプライン（Bitextor）と英語への翻訳（MarianNMT 教師モデルの蒸留を含む）を用いたTF/IDF、Bleualign、Biflexer、Bicleaner-AI。
デ duplicatio n による近似重複をMinHashで検出・除去することで、再現性のためのデータを重複除去後および除去前のデータとしてリリース。

Figure 1: General overview of the HPLT acquisition and processing pipeline.

実験結果

リサーチクエスチョン

RQ1ウェブ規模の多言語コーパスを、厳密なメタデータと管理性を確保してどのように構築できるか。
RQ275言語と18言語ペアで入手可能なモノリンガルおよび英語中心のパラレルリソースの規模と特徴は何か。
RQ3データの後処理（重複排除、クリーニング、バイト帯の方法によるスキャフォルディング）が言語モデル学習およびMTトレーニングのデータ利用性をどの程度改善するか。
RQ4合成的な多言語ピボット（multiHPLT）が英語中心ペアを超える言語カバレッジを拡大する可能性は何か。

主な発見

言語ペア	生データセグメント	生データトークン	フィルタ後セグメント	フィルタ後トークン	重複除外後セグメント	重複除外後トークン
Norwegian (nn)	28 701 601	496 496 331	649 435	6 308 500	132 538	2 082 878
Bosnian* (bs)	26 998 901	521 626 621	1 426 670	12 439 348	240 012	2 705 525
Basque (eu)	20 830 243	400 262 771	3 087 453	31 739 210	610 687	9 964 617
Maltese (mt)	135 103 434	2 820 798 439	9 170 421	133 140 189	854 820	18 819 145
Gaelic (ga)	101 001 090	2 013 971 167	15 644 170	144 323 574	994 746	16 327 484
Galician (gl)	56 101 411	1 015 559 754	5 789 361	49 604 655	1 063 103	13 904 758
Macedonian (mk)	91 293 129	1 868 196 128	20 474 476	221 370 998	1 139 051	18 562 461
Albanian (sq)	253 098 546	5 819 014 143	16 729 596	144 732 656	1 655 958	25 831 054
Swahili (sw)	247 557 313	5 746 490 123	24 448 577	209 062 077	1 710 205	20 039 612
Icelandic (is)	170 419 019	3 266 074 902	28 149 571	262 486 823	2 148 854	29 493 241
Serbian* (sr)	754 277 462	14 249 438 714	60 482 286	586 909 655	4 643 025	67 063 293
Chinese (zh)	530 119 983	9 162 123 041	47 852 076	510 404 638	5 306 570	83 811 653
Estonian (et)	865 431 226	15 476 948 993	72 976 009	752 767 471	6 089 791	95 943 562
Catalan (ca)	402 492 626	8 034 120 323	88 434 510	882 436 335	8 905 889	141 859 163
Croatian* (hr)	895 785 142	16 565 285 999	128 145 132	1 165 895 906	9 310 275	138 360 666
Hindi (hi)	1 043 856 525	19 246 270 565	117 341 153	996 036 740	12 043 069	165 139 713
Arabic (ar)	1 545 148 805	33 199 212 426	277 864 501	2 307 727 128	14 645 128	239 377 462
Finnish (fi)	3 826 974 191	65 312 092 463	495 310 671	4 186 819 006	25 176 462	338 063 309
Total	-	-	-	-	-	-

モノリンガルコーパス（monoHPLT）は75言語をカバーし、約5.6兆トークン、デ-デュプリケーション後の非圧縮テキストは約50.1 TB。
パラレルコーパス（biHPLT）は18言語ペアを含み、1,000万を超えるクリーンでデ-デュプリケーション済み文ペアと14億を超える英語トークンを含む。
合成的な多言語ピボットデータ（multiHPLT）は英語を介して171言語ペア、約157百万文ペアを生み出す。
公開データのノートとして、CCMatrixとParaCrawlとの重複はそれぞれ約3.35%および約15.72%に留まり、 substantial な新規性を示す。
処理パイプラインには言語識別、フィルタリング、バイリンガル候補品質管理（Bicleaner AI）を組み込み、並列データの生データからフィルター後までのデータ量を約90%削減する。
環境負荷面では、全パイプラインの推定CPU時間約500万時間、GPU時間約5万時間が必要とされ、LUMIの再生可能エネルギーコンテキストを背景にする。

Figure 2: Size distribution for the monolingual corpora, organized by language family and language. The volume of texts ranges from 1.0 GB for text classified by CLD2 as Esperanto to 20.3 TB for English, accounting for 41% of the whole collection.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。