QUICK REVIEW

[論文レビュー] On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Marius Mosbach, Maksym Andriushchenko|arXiv (Cornell University)|Jun 8, 2020

Nuclear reactor physics and engineering参考文献 41被引用数 211

ひとこと要約

本論文は、BERT 系モデルのファインチューニングの不安定さが主に最適化の難易度（勾配消失）と一般化の分散に起因し、破滅的忘却や小規模データによるものではないことを示し、安定性を大幅に改善するシンプルで強力なベースラインを導入している。

ABSTRACT

Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.

研究の動機と目的

BERTベースのモデルのファインチューニングがシード間でなぜ不安定になるのかを調査する。
一般に引用される仮説（破滅的忘却、データ量不足）が不安定性の原因かを評価する。
不安定性を最適化と一般化の成分に分解する。
安定性と性能を向上させる、シンプルで頑健なファインチューニングのベースラインを提案する。

提案手法

GLUEタスクでのBERT、RoBERTa、ALBERTのファインチューニング安定性を分析する。
最適化の問題を引き起こす勾配を調べ、失敗の原因を特定する。
ADAMのバイアス補正とウォームアップの影響を評価する。
安定性への影響を高めるために、トレーニング反復回数を多く保つ（長い訓練）ことの影響を評価する。
バイアス補正と拡張訓練を組み込んだ、シンプルなベースラインファインチューニング設定を提案・検証する。

実験結果

リサーチクエスチョン

RQ1BERTベースのモデルのファインチューニング中に観測される不安定性は何によって引き起こされるのか？
RQ2破滅的忘却とデータサイズの小ささは不安定性の主な原因か？
RQ3最適化ダイナミクス（例：勾配消失）と一般化は不安定性にどう寄与するのか？
RQ4シンプルなベースラインはアーキテクチャやデータセットを超えてファインチューニングの安定性を改善できるのか？

主な発見

Approach	RTE std	RTE mean	RTE max	MRPC std	MRPC mean	MRPC max	CoLA std	CoLA mean	CoLA max
Devlin et al. (2019)	4.5	50.9	67.5	3.9	84.0	91.2	25.6	45.6	64.6
Lee et al. (2020)	7.9	65.3	74.4	3.8	87.8	91.8	20.9	51.9	64.0
Ours	2.7	67.3	71.1	0.8	90.3	91.7	1.8	65.3	62.1

不安定性は、破滅的忘却や小さなデータだけでなく、最適化の難点（勾配消失）と後期段階の一般化分散によってよりよく説明される。
失敗した実行では下部層で勾配が消失しているのが見られる一方、成功した実行では訓練全体を通じて勾配が強い。
Adamのバイアス補正とウォームアップ様の効果は、特にBERTとALBERTの安定性を著しく改善する；RoBERTaも恩恵を受けるが程度は小さい。
訓練反復回数を増やし、訓練損失をほぼゼロに近づけることで、開発データの性能の一貫性が高まる。
AdamW・バイアス補正・学習率2e-5・20エポックというシンプルなベースラインは、シード間の変動を markedly低く抑え、RTE、MRPC、CoLAで平均・最大性能を競争力のある水準へと導く。
本結果はBERTの枠を超え、RoBERTaとALBERTにも一般化する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。