QUICK REVIEW

[論文レビュー] BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

Javier de la Rosa, Eduardo G. Ponferrada|arXiv (Cornell University)|Jul 14, 2022

Topic Modeling被引用数 33

ひとこと要約

tldr: BERTIN は perplexity ベースのサンプリングが mC4-es から 50M-document のサブセットを作成し、スペイン語 RoBERTa-base モデルを効率的に事前学習させ、競争力のある MLM および下流タスクの性能を示す。

ABSTRACT

The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name $ extit{perplexity sampling}$ that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget. Our models are available at this $\href{https://huggingface.co/bertin-project}{URL}$.

研究の動機と目的

Investigate how much data is needed to train a high-quality monolingual Spanish language model.
Explore document sampling methods to improve data efficiency during pre-training.
Assess how data quality and sampling impact training time and final model performance.
Release datasets and code to enable replication and further research.

提案手法

Compute perplexity for documents in a Spanish subset using a 5-gram KenLM model trained on Spanish Wikipedia.
Define two sampling functions (Stepwise and Gaussian) to oversample central perplexity ranges and bias against very low/high perplexity texts.
Compare against random sampling as a baseline and train RoBERTa-base style MLM with 128- and 512-token sequences for ~250k steps.
Use identical training hyperparameters to a prior RoBERTa setup, with staged sequence-length extension and TPUv3-8 hardware.
Evaluate models on downstream Spanish tasks (POS, NER, PAWS-X, XNLI) and report MLM accuracy for different sequence lengths.

実験結果

リサーチクエスチョン

RQ1RQ1 How much data is enough to train a well-performing monolingual Spanish language model?
RQ2RQ2 How to select documents to enable more efficient training when data is abundant?
RQ3RQ3 How does data quality affect training time and model performance?

主な発見

Method	MLM@128	MLM@512
Random	65.20	59.07
Stepwise	65.34	67.44
Gaussian	66.08	68.73

Gaussian perplexity sampling generally yields more consistent and strong performance across tasks.
All sampling methods outperform random sampling in several downstream tasks, with Gaussian -512 achieving strong results.
MLM accuracy for Gaussian -128 is 66.08 and Gaussian -512 is 68.73, higher at 512 seq length than 128.
Compared to baselines, Gaussian and Stepwise sampling achieve competitive or superior results on NER and PAWS-X in certain configurations.
Training a RoBERTa-base Spanish model on ~50M documents (~200GB after subsampling from 1TB) can yield competitive results within roughly a week on TPUv3-8.
The study provides released datasets and code to reproduce perplexity-based sampling and model training.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。