QUICK REVIEW

[論文レビュー] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Jesse Dodge, Gabriel Ilharco|arXiv (Cornell University)|Feb 15, 2020

Topic Modeling参考文献 29被引用数 216

ひとこと要約

この論文は、重み初期化とデータ順序のシードによるBERTのファインチューニングの大きなばらつきを示し、複数回の試行からの利得を示し、早期停止アプローチを導入し、GLUEタスクの2,100件のファインチューニング実行を公開している。

ABSTRACT

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

研究の動機と目的

事前学習済み言語モデルのファインチューニング性能に対するランダムシードの影響を理解する。
性能のばらつきに対する重み初期化とデータ順序の寄与を定量化する。
複数のファインチューニング試行が単一試行に比べて顕著な利得を生むか評価する。
性能を維持しつつ計算コストを削減するための早期停止戦略を提案する。
トレーニングダイナミクスの分析を促進するためファインチューニングデータを公開する。

提案手法

最終層のウェイト初期化（WI）とデータ順序（DO）を制御するランダムシードのみを変えつつ、BERT-largeを4つのGLUEタスクでファインチューニングする。
各タスクを標準ハイパーパラメータで3エポック訓練し、すべてのシード組み合わせに対して検証性能を報告する。
WIとDOを分離したシードのグリッドを用いてばらつきを分析し、試行回数の関数として期待最大性能を算出する。
ANOVAを用いて最良のWI/DOシードと最悪のそれが平均性能で異なるかを検定する。
計算資源を節約するため、訓練途中であまり有望でない試行を停止する簡易な早期停止アルゴリズムを提案・評価する。
訓練損失と検証性能を含む2,100件のファインチューニング実行の完全データセットを公開する。

実験結果

リサーチクエスチョン

RQ1ファインチューニング性能のばらつきは、重量初期化とデータ順序を制御するランダムシードにどれだけ起因するのか？
RQ2いくつかのWIとDOシードはタスクを超えて一貫して他より優れているか、データセット間で一般化するシードはあるか？
RQ3GLUEタスクで複数回のファインチューニング試行を実施する利点は、達成可能な検証性能の最大値という観点でどの程度か？
RQ4早期停止戦略は最終性能の損失を抑えつつ計算量を削減できるか？

主な発見

異なるシードを用いた複数のファインチューニング試行は、4つのGLUEタスクで単一試行の結果よりも顕著な利得をもたらす。
重み初期化とデータ順序は性能のばらつきに同程度に寄与し、いくつかのシードはタスクを跨って一貫して高い性能を示す。
いくつかのシード初期化は複数のタスクで良好に機能し、全般的に有利なWIシードが存在することを示唆している。
早期停止は、予算全体で類似または改善された期待性能を達成しつつ計算コストを削減できる。
多くの試行で見つかった最高性能は、同じモデルと設定を用いた以前の公表結果を複数のタスクで大幅に上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。