QUICK REVIEW

[論文レビュー] Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Alex Warstadt, Leshem Choshen|arXiv (Cornell University)|Jan 27, 2023

Natural Language Processing Techniques被引用数 16

ひとこと要約

BabyLM Challenge の論文募集で、小規模事前学習を発達的に妥当なデータで行い、共通の評価パイプラインを備えた3つのトラックを提供します。

ABSTRACT

We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline which scores models on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.

研究の動機と目的

限られたデータサイズで人間の言語入力を模倣する前処理研究を促進する。
大学予算でも参加しやすい環境を作り、事前学習研究を民主化する。
データ効率と認知妥当性を向上させる技術を探る。
NLPにおける認知科学・言語学の洞察のためのプラットフォームを提供する。
複数タスクのベンチマークを含む透明な評価パイプラインを提供する。

提案手法

固定データセットまたはデータ制限を柔軟に設定する3つのトラック（Strict、Strict-small、Loose）。
転写音声に焦点を当てた100M語未満の発達的に妥当な事前学習コーパス。
HuggingFace トランスフォーマーと互換性のある共有評価パイプライン（スコアリングと下流タスク）。
固定データセットで学習する既存のLM（OPT、RoBERTa、T5）由来のベースラインモデルを提供。

Figure 1: Data Scale: Modern Language Models are trained on data multiple orders of magnitude larger than the amount available to a typical human child. Image based off Fig. 1 from Warstadt and Bowman ( 2022 )

実験結果

リサーチクエスチョン

RQ1データが人間に類する規模（10M語対100M語）に限定されたとき、言語モデルの事前学習はどのように性能を発揮するか。
RQ2発達的に妥当なデータ制約の下で、データ効率を改善する建築、目的、カリキュラム戦略は何か。
RQ3制限されたデータでのターゲット構文評価と自然言語理解において、モデルはどの程度の性能を示すか。
RQ4厳格なデータ制限と緩やかなデータ制限の下で、リソースと効率性のトレードオフは何か。

主な発見

Dataset	Domain	Strict-small	Strict	Proportion
CHILDES (MacWhinney, 2000)	Child-directed speech	0.44M	4.21M	5%
British National Corpus (BNC), 1 dialogue portion	Dialogue	0.86M	8.16M	8%
Children’s Book Test (Hill et al., 2016)	Children’s books	0.57M	5.55M	6%
Children’s Stories Text Corpus 2	Children’s books	0.34M	3.22M	3%
Standardized Project Gutenberg Corpus (Gerlach and Font-Clos, 2018)	Written English	0.99M	9.46M	10%
OpenSubtitles (Lison and Tiedemann, 2016)	Movie subtitles	3.09M	31.28M	31%
QCRI Educational Domain Corpus (QED; Abdelali et al., 2014)	Educational video subtitles	1.04M	10.24M	11%
Wikipedia 3	Wikipedia (English)	0.99M	10.08M	10%
Simple Wikipedia 4	Wikipedia (Simple English)	1.52M	14.66M	15%
Switchboard Dialog Act Corpus (Stolcke et al., 2000)	Dialogue	0.12M	1.18M	1%
Total	–	9.96M	98.04M	100%

三トラック構成（Strict、Strict-small、Loose）と明確に定義されたデータ制約および共通の評価パイプライン。
子ども向け話し言葉、対話、児童書、Wikipedia、字幕など多様なドメインの約100M語の固定データセット。
10M語のStrictトラックと100M語のLooseトラックによりデータ効率の高い事前学習手法を探究可能。
ベースラインモデル（OPT、RoBERTa、T5 系列）を提供し、比較のナイーブな出発点を提供。
評価は効率性・認知的妥当性・下流NLPタスクを重視し、広くアクセスできる Colab ベースのパイプラインを提供。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。