QUICK REVIEW

[論文レビュー] WikiBERT models: deep transfer learning for many languages

Sampo Pyysalo, Jenna Kanerva|arXiv (Cornell University)|Jun 2, 2020

Natural Language Processing Techniques参考文献 20被引用数 26

ひとこと要約

本論文では、低・中リソース言語向けに、Wikipediaデータに限定してトレーニングされた42種類の言語固有BERTモデル—WikiBERT—を完全自動パイプラインで作成する手法を紹介する。Universal Dependencies構文解析ベンチマークで評価した結果、WikiBERTモデルは平均的にmBERTを上回った（LAS 86.6% 対 86.1%）、フィンランド語では顕著な向上が見られた一方、白ロシア語では低下が生じた。これは、1億～10億トークン程度の事前学習データ量でパフォーマンスの最適化が達成される可能性を示唆している。

ABSTRACT

Deep neural language models such as BERT have enabled substantial recent advances in many natural language processing tasks. Due to the effort and computational cost involved in their pre-training, language-specific models are typically introduced only for a small number of high-resource languages such as English. While multilingual models covering large numbers of languages are available, recent work suggests monolingual training can produce better models, and our understanding of the tradeoffs between mono- and multilingual training is incomplete. In this paper, we introduce a simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data and introduce 42 new such models, most for languages up to now lacking dedicated deep neural language models. We assess the merits of these models using the state-of-the-art UDify parser on Universal Dependencies data, contrasting performance with results using the multilingual BERT model. We find that UDify using WikiBERT models outperforms the parser using mBERT on average, with the language-specific models showing substantially improved performance for some languages, yet limited improvement or a decrease in performance for others. We also present preliminary results as first steps toward an understanding of the conditions under which language-specific models are most beneficial. All of the methods and models introduced in this work are available under open licenses from https://github.com/turkunlp/wikibert.

研究の動機と目的

多くの低・中リソース言語に対して、高品質で言語固有のBERTモデルが不足しているという問題に対処すること。
Wikipediaデータのみを用いて、完全自動的かつスケーラブルなパイプラインを構築し、このようなモデルを生成すること。
多言語依存構文解析ベンチマーク上で、これらのモデルのmBERTベースラインとのパフォーマンスを評価すること。
単一言語の事前学習が多言語事前学習を上回る条件を解明すること。
モデルとパイプラインをオープンソースライセンスで公開し、NLP分野の研究開発を促進すること。

提案手法

著者らは、309言語のWikipediaダンプからテキストを抽出・前処理・トークン化する完全自動パイプラインを構築した。
言語固有のBERTモデルは、死語や使用されていない言語を除き、Wikipediaテキストに限定して事前学習した。
事前学習プロセスでは、標準的なBERTの目的関数（マスク言語モデルと次文予測）を用いた。
微調整と評価には、Universal Dependenciesのツリー・バンクを用いてUDify依存構文解析器を用いた。
主な指標としてLAS（ラベル付きアタッチメントスコア）を用い、mBERTとWikiBERTの初期化のパフォーマンスを42言語で比較した。
分析は、事前学習データ量や言語系統の関係性と相関する相対的パフォーマンス変化に焦点を当てた。

実験結果

リサーチクエスチョン

RQ1Wikipediaデータ上で言語固有のBERTモデルをトレーニングすることで、mBERTのような多言語モデルを上回るパフォーマンスが得られるか？
RQ2事前学習データ量が、言語固有モデルのパフォーマンス向上に与える影響は何か？
RQ3言語の類縁関係や言語系統の所属が、単一言語モデルと多言語モデルの相対的パフォーマンスに与える影響は何か？
RQ4特定の言語的特徴や閾値（例：データ量、タイプロジー的特徴）が、単一言語モデルが多言語モデルを上回る条件を決定づけるか？
RQ5完全自動パイプラインは、広範な言語範囲において、高品質な言語固有BERTモデルを信頼性高く生成できるか？

主な発見

UDifyをWikiBERTで初期化した場合、平均LASは86.6%に達し、mBERTの86.1%をわずかに上回った。
mBERTをWikiBERTに置き換えたことで、平均して約4%の相対的誤差低減が見られ、明確なパフォーマンス向上が示された。
フィンランド語では最大のパフォーマンス向上が見られ、mBERTと比較してLAS誤差が10%以上相対的に低下した。
白ロシア語では最大のパフォーマンス低下が生じ、高リソース言語に近縁であっても、単一言語事前学習が利益をもたらすとは限らないことを示唆した。
1億～10億トークン程度の事前学習データ量で、mBERTを上回るパフォーマンス向上が最も顕著に現れる「パフォーマンスの最適化領域（sweet spot）」が確認された。
英語（データ量が多く、ゲルマン語族に属する）では、mBERTとWikiBERTのパフォーマンスがほぼ同一であったため、高リソース環境下では単一言語事前学習に利点がないことが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。